python urllib2 處理編碼的兩個注意點

urllib2可以抓取網頁，為了模擬瀏覽器需要增加如下header：

把header作為乙個dict傳引數，但是由於請求gzip，所以需要對返回結果進行解壓，或者就不進行http gzip請求

from stringio import stringio

import gzip

req  = urllib2.request(url, headers=headers)
resp = urllib2.urlopen(req)
content = ''
# handle gzip compress

# 這裡需要注意，因為模擬chrome的請求，所以返回的是gzip格式的編碼，而urllib2是不會自動處理編碼的，需要用stringio和gzip來協助處理，得到解壓後的串

#否則會報錯：unicodedecodeerror: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte

if resp.info().get('content-encoding') == 'gzip':
buf = stringio(resp.read())
f = gzip.gzipfile(fileobj=buf)
content = f.read()
else :
content = resp.read()

# 這裡根據網頁返回的實際charset進行unicode編碼
encoding = resp.headers['content-type'].split('charset=')[-1]
ucontent = unicode(content, encoding)

參考：

Python urllib2使用總結

import urllib2 response urllib2.urlopen html response.read 這個過程就是基於簡單的請求響應的模型 response urllib2.urlopen 實際上可以看作兩個步驟 1 我們向指定網域名稱傳送請求 request urllib2.re...

python urllib2查詢資料

最近為了更好的查詢老王python的外鏈，所以準備寫乙個python urllib2 查詢指令碼來查詢，一般查詢外鏈比較準確的工具還是yahoo的外鏈工具，但是有點不方便的就是，yahoo查出的外鏈要一頁一頁的翻，好累而且不好方便統計，我是想把的外鏈全部讀取到檔案裡，這樣比較好在本地來進行統計。廢...

Python urllib2產生殭屍程序

最近發現，python 會產生很多殭屍程序，之前未發現，自從使用urllib2模組發http請求之後，便產生了大量殭屍程序，確定是由於urllib2產生，原始如下 req urllib2.request url urllib2.urlopen req 最開始，想當然的任務，http請求不是長連線，...

python urllib2 處理編碼的兩個注意點

Python urllib2使用總結

python urllib2查詢資料

Python urllib2產生殭屍程序

相關推薦