python 判斷字元編碼

一般情況下，需要加這個：

import
sysreload(sys)
sys.setdefaultencoding(
'utf-8
')

開啟其他檔案編碼用codecs.open

讀下面的**讀取了檔案，將每一行的內容組成了乙個列表。

import
codecs
file = codecs.open('
test.txt
','r
','utf-8')
lines = [line.strip() for line in
file] 
file.close()

當我們不知道檔案編碼的時候，如何程式判斷呢？

使用 chardet 模組，這樣就可以跟codecs結合起來了。

import
chardet  
import
urllib    #
可根據需要，選擇不同的資料  
testdata = urllib.urlopen('
').read()  
print
chardet.detect(testdata)  
執行結果：

參考：這裡面還有判斷網頁的編碼方式

大檔案可以只需要讀幾行

這種格式的轉換為正常自體

a=u"
\u5973\u7ae5\u8f8d\u5b66\u7167\u987e\u75c5\u7236
"print
aa='
\u559c\u6b22\u4e00\u4e2a\u4eba
'print a.decode('
raw_unicode_escape
')

/usr/bin/python2.7 /home/dahu/myfile/my_git/core-scrapy-learning/toutiao/toutiao/t1.py

女童輟學照顧病父

喜歡乙個人

process finished with exit code

Python判斷字串編碼以及編碼的轉換

判斷字串編碼使用 chardet 可以很方便的實現字串檔案的編碼檢測。尤其是中文網頁，有的頁面使用gbk gb2312，有的使用utf8，如果你需要去爬一些頁面，知道網頁編碼很重要 import urllib html urllib.urlopen read import chardet cha...

python 判斷網頁編碼

這段時間在用python處理網頁抓取這塊，網際網路很多網頁的編碼格式都不一樣，大體上是gbk,gb2312，utf 8，等待。我們在獲取網頁的的資料後，先要對網頁的編碼進行判斷，才能把抓取的內容的編碼統一轉換為我們能夠處理的編碼。比如beautiful soup內部的編碼就是unicode的編碼。...

判斷字串編碼

size large 猜測法猜測一種字串編碼，然後使用該編碼對字串進行編碼，還原。如果猜測錯誤，字串會被破壞，還原城亂碼。size 判斷字串編碼 param str return public static string getencoding string str catch exception...

python 判斷字元編碼

Python判斷字串編碼以及編碼的轉換

python 判斷網頁編碼

判斷字串編碼

相關推薦