python 爬蟲入門一）

今天學了python爬蟲，很有意思，寫一下部落格記錄一下學習過程。

最基本的爬蟲僅需要urllib庫，re庫和chardet庫

urllib庫是python內建的處理網路請求的庫。對於基本的爬蟲我們僅僅需要使用他的內部模組urllib.requset。

urllib.request中所要使用的函式

urllib.request.urlopen(url(**)) 會返回乙個

re庫是正規表示式庫，用來字串模式匹配，尋找我們所需要的網頁內容。

chardet庫是用來獲取網頁編碼方式的庫，可以使用chardet.detect()函式獲取網頁使用的編碼格式。

首先用先用urllib庫爬取網頁資訊，用chardet庫獲取網頁使用編碼，再把爬取的網頁資訊轉換成二進位制檔案，用解碼函式把二進位制檔案用已知編碼解碼。這時就可以得到網頁的完整資訊了。查詢原網頁你所需要資訊的**。分析其模式，用正規表示式提取他們的資訊，最後寫入資料夾即可。

urlopen()函式的用法為

urlopen(url, data=none, timeout=socket._global_default_timeout,*, cafile=none, capath=none, cadefault=false, context=none)

url為網頁**(字串格式)

data為訪問方式，一般預設就行

timeout為訪問結束時間

其他的一般都不需修改，使用預設值即可

使用urlopen函式返回的http.client.httpresponse物件使用read()函式解析為二進位制檔案。

chardet.detect()函式返回的是乙個字典

其中encoding代表編碼，confidence代表精度

這時再用decode()方法解碼即可得到整個頁面資訊。

re庫中常用的是函式是match(pattern,string)、search(pattern,string)和findall(pattern,string)

match是只匹配開頭，search是從開頭開始匹配第乙個,findall是匹配全文所有的。

下面給大家帶來乙個小例子

爬取豆瓣圖書出版社名稱和銷售書的數目

from urllib.request import urlopen
import urllib.request
import chardet
import re
class publish(object):
def __init__(self):
pass
def getinfo(self,address):
response = urlopen(address,timeout=2).read()
char = chardet.detect(response)
data = response.decode(char['encoding'])
pattern1 = '(.*?)
'        pattern2 = '(.*?) 部作品在售
'        result1 = re.compile(pattern1).findall(data)
result2 = re.compile(pattern2).findall(data)
return [result1,result2]
pass
def writetxt(self,address,filename):
result = self.getinfo(address)
f = open(filename,'w',encoding='utf-8')
lenth = result[0].__len__()
for i in range(0,lenth):
f.write(str(i+1) +'\t' + result[0][i] + '\t' +result[1][i] + '\n')
pass
f.close()
pass
pass
if __name__ == '__main__':
publish = publish()
filename = 'publish.txt'
address = ''
publish.writetxt(address,filename)
pass

python-爬蟲入門(二)

Python爬蟲入門一

作為入門學習，我選擇了乙個靜態生物資訊交流分享論壇 public library of bioinformatics plob 第一次嘗試，只抓取網頁文字部分。安裝requests庫和bs4庫前者用來鏈結和處理http協議後者將網頁變成結構化資料，方便抓取。easy install requ...

Python爬蟲入門一

python版本 2.7 首先爬蟲是什麼？網路爬蟲又被稱為網頁蜘蛛，網路機械人，在foaf社群中間，更經常的稱為網頁追逐者是一種按照一定的規則，自動的抓取全球資訊網資訊的程式或者指令碼。根據我的經驗，要學習python爬蟲，我們要學習的共有以下幾點首先，我們要用python寫爬蟲，肯定要了解p...

python爬蟲入門（一）

1.什麼是爬蟲 2.爬蟲的核心 3.爬蟲的語言 4.爬蟲分類聚焦爬蟲就是現在我們這些爬蟲程式設計師所進行的工作，就是根據客戶的需求，爬取指定的特定內容。1.首先要理解什麼是http https協議 2.python基礎知識 3.開發工具 4.抓包工具上面的只能簡單抓取網頁，一旦遇到需要請求...

python 爬蟲入門 一）

Python爬蟲入門一

Python爬蟲入門 一

python爬蟲入門（一）

相關推薦

python 爬蟲入門一）

Python爬蟲入門一