python urllib解析網頁編碼出錯

問題描述：在用urllib解析網頁的時候，有時候的編碼並不是網頁中的編碼（如下）。

解決過程

嘗試檢查編碼

import chardet
req = urllib2.request(url)
data = urllib2.urlopen(req).read()
det = chardet.detect(data)
###

答案

通過查資料，找到了乙個解決方法，嘗試成功，

data = urllib2.urlopen(req).read().decode('gbk','ignore').encode('utf-8')

原因

剛找到了問題所在，這是因為html中混入了非法字元，導致chardet.detect(data)辨認錯誤

直接decode(『gbk』,』ignore』).encode(『utf-8』)應該就可以解決

Python urllib簡單使用

python的urllib和urllib2模組都做與請求url相關的操作。它們最顯著的差異為 urllib2可以接受乙個request物件，並以此可以來設定乙個url的headers，但是urllib只接收乙個url。urllib模組可以提供進行urlencode的方法，該方法用於get查詢字串的生...

python urllib簡單用法

簡單獲取網頁原始碼 html urlopen 開啟鏈結 print html.read decode utf 8 返回utf 8編碼的原始碼模擬傳送post請求 req request postdata parse.urlencode name str1 tel str2 mac str3 re...

python urllib模組學習筆記

這個模組是最基本最常用的，以前看過，總結一下 coding utf 8 import urllib url 伺服器 proxies 使用伺服器開啟 r urllib.urlopen url,proxies proxies print r.info print r.getcode print r.g...

python urllib解析網頁編碼出錯

Python urllib簡單使用

python urllib簡單用法

python urllib模組學習筆記

相關推薦