Python大資料分析之網路爬蟲

\d匹配乙個數字，\d匹配乙個非數字，\w匹配乙個字母或數字，.可以匹配任意乙個字元，*表示任意字元，+表示至少乙個字元

?表示0個或1個字元，表示n個字元，用表示n~m個字元。

（1）\d 表示匹配3個數字

（2）\s+ 表示至少匹配乙個空格，\s 表示匹配任何非空白字元

[\s\s]* 可以包括換行符在內的任意字元

（3）\d表示匹配3~8個數字

（4）[0-9a-za-z\_] 匹配乙個數字、字母或者下劃線

（5）a|b 匹配a或b

（6）^\d 表示匹配以數字開頭

（7）\d$表示匹配以數字結尾

（8）match物件上用group()提取子串，group(0)是原始字串，group(1)表示第乙個子串……

>>> m.groups()

('010', '12345')

（9）正則匹配預設是貪婪匹配，加1個?表示採用非貪婪匹配

（10）如果乙個正規表示式要重複使用，出於效率的考慮，可以預編譯該正規表示式，接下來重複使用時

就不需要編譯了，直接匹配即可。

a="我叫胡胡"

str_gb2312=a.encode('gb2312') #將str轉換為gb2312編碼

str_utf8=str_gb2312.decode('gb2312').encode('utf-8') #先解碼在編碼成utf-8

利用第三方庫chardet判斷編碼格式

import chardet

det=chardet.detect(a)

import
requests
res=requests.get('
')content=res.text
print
(type(content))

beautifulsoup可以從html或xml檔案中提取資料，主要功能就是從網頁抓取資料。

首先安裝該庫：pip install beautifulsoup4，還要安裝html直譯器：pip install lxml，接上面的**。

from bs4 import
beautifulsoup
soup=beautifulsoup(content,'
lxml')
print(soup.prettify())  #
按html格式列印內容
#從文件中找到所有標籤的鏈結
for a in soup.find_all('a'
):  
print('
attrs: 
',a.attrs)  #
取a標籤的屬性
print('
string: 
',a.string) #
取a標籤的字串
print('
---------------')
#attrs引數定義乙個字典引數來搜尋包含特殊屬性的tag
for tag in soup.find_all(attrs=):
print('
tag: 
',tag.name)
print('
attrs: 
',tag.attrs)    
print('
string: 
',tag.string)
print('
---------------')
#找出包含內容為教育部的標籤
for tag in soup.find_all(name='
a',text='
教育部'
):    
print('
tag: 
',tag.name)
print('
attrs: 
',tag.attrs)    
print('
string: 
',tag.string)
print('
---------------')
import
refor tag in soup.find_all(attrs=-\w
')}):
print
(tag)
print('
---------------
')

（1）儲存到csv檔案

csv="""
id,name,score
l,xiaohua,23
2,xiaoming,67
3,xiaogang,89
"""with open(r
"d:/1.csv
",'w
') as f:
f.write(csv)

（2）儲存到資料庫

爬取豆瓣電影top250資料。

import
requests
from bs4 import
beautifulsoup
importre#
獲取網頁原始碼，生成soup物件
defgetsoup(url,headers):
res = requests.get(url,headers=headers) 
return beautifulsoup(res.text,'
lxml')
#解析資料
defgetdata(soup):
data=
ol=soup.find('
ol',attrs=)
for li in ol.find_all('li'
):        tep=
titles=
for span in li.find_all('
span'):
if span.has_attr('
class'):
if span.attrs['
class
'][0]=='
title
':   
#獲取電影名
elif span.attrs['
class
'][0]=='
rating_num':
#獲取評分
elif span.attrs['
class
'][0]=='
inq':#
tep.insert(0,titles)
print
(tep)
print("
-------------")
print
(data)
print("
**********===")
return
data
#def
nexturl(soup):
a=soup.find('
a',text=re.compile('
^後頁'
))    
ifa:
return a.attrs['
href']
else
:        
return
none
if__name__ == '
__main__':
headers=
url="
"soup=getsoup(url,headers)
data=getdata(soup)
print(data)
nt=nexturl(soup)
while
nt:        soup=getsoup(url+nt,headers)
print(getdata(soup))
nt=nexturl(soup)

Python大資料分析開篇

python大資料分析開篇目前在網上看了很多部落格，都是一些關於資料處理的，且都淺嘗輒止，沒有形成乙個系列，只言片語，不能給人以更深層次的啟發。加之，最近在用python做金融大資料這塊的分析，故寫部落格以記之，以供他人閱，相互交流。大資料分析的意義，我自不用多述。眾多金融公司，無不在挖掘其價值...

python大資料分析 Matplotlib庫

matplotlib作圖基本 import numpy as np import matplotlib.pyplot as plt x np.linspace 0,10,1000 x軸的自變數 y np.sin x 1 函式 z np.cos x 2 1 函式 plt.figure figsize ...

Python金融大資料分析回歸分析

回歸分析是金融中乙個繞不過的話題，其實最好的工具應該是r語言，但是pandas其實也是能夠勝任絕大部分工作的。這裡我們就簡單介紹一下。import pandas as pd import numpy as np import matplotlib.pyplot as plt noise np.ran...

Python大資料分析之網路爬蟲

Python大資料分析 開篇

python大資料分析 Matplotlib庫

Python金融大資料分析 回歸分析

相關推薦

Python大資料分析開篇

Python金融大資料分析回歸分析