python網路爬蟲學習筆記（1）

（一）三種網頁抓取方法

1、正規表示式：

模組使用c語言編寫，速度快，但是很脆弱，可能網頁更新後就不能用了。

2、beautiful soup

模組使用python編寫，速度慢。

安裝：pip install beautifulsoup4

3、 lxml

模組使用c語言編寫，即快速又健壯，通常應該是最好的選擇。

（二） lxml安裝

pip www.cppcns.cominstall lxml

如果使用lxml的css選擇器，還要安裝下面的模組

pip install cssselect

（三）使用lxml示例

import urllib.request as re

import lxml.html

#**網頁並返回html

def download(url,user_agent='socrates',num=2):

print('**:'+url)

#設定使用者**

headers =

request = re.request(url,headers=headers)

try:

#**網頁

html程式設計客棧 = re.urlopen(request).read()

except re.urlerror as e:

print('**失敗'+e.reason)

html=none

if num>0:

#遇到5xx錯誤時，遞迴呼叫自身重試**，最多重複2次

if hasattr(e,'code') and 500<=e.code<600:

return download(url,num-1)

return html

html = downlwww.cppcns.comoad('')

#將html解析為統一的格式

tree = lxml.html.fromstring(html)

# img = tree.cssselect('img.bde_image')

#通過lxml的xpath獲取src屬性的值，返回乙個列表

img = tree.xpath('//img[@class="bde_image"]/@src')

x= 0

#迭代列表img,www.cppcns.com將儲存在當前目錄下

本文標題: python網路爬蟲學習筆記（1）

本文位址:

Python網路爬蟲學習（1）

使用python爬取amazon上的商品資訊簡單使用beautifulsoup 以下內容是根據mooc課程 python網路爬蟲與資訊提取北京理工大學的第一周和第二週第一單元的部分學習記錄。path root my url.split 1 以名字儲存 r requests.get my url...

網路爬蟲 python學習筆記

pip install requestsr requests.get url r requests.get url,params none,kwargs request其實只有乙個方法 request 有兩個物件 import request r requests.get print r.statu...

python網路爬蟲學習筆記

爬取網頁的通用框架網路爬蟲的盜亦有道 requests爬取例項自動爬取html頁面自動網路請求提交主要方法說明requests.request 構造乙個請求 requests.get 獲取html網頁的主要方法，對應於http的get requests.head 獲取html網頁頭資訊的...

python網路爬蟲學習筆記（1）

Python網路爬蟲學習（1）

網路爬蟲 python學習筆記

python網路爬蟲學習筆記

相關推薦