Python 編碼小結

字元是個抽象的概念，在 python 中字串都是用 unicode 表示的，那些形如『abcd』這樣常見的形式，我在這裡把它稱為位元組串（位元組序列）以示區別。當我們面對一串位元組時，如果我們不知道編碼方法，我們根本不知道該如何解釋眼前的這些位元組組成的序列。

本文中我們用str 表示位元組串，uni 表示 unicode 字串

python 中有好多處理字串的函式，它們幾乎都可以接受位元組串和 unicode 字串。當傳入的引數是位元組串時，python 會先用預設的 ascii 編碼將這串位元組轉換成對應的 unicode 字串，然後再使用相應的函式進行處理，因此當位元組序列中，出現碼值大於 127的位元組時，解碼就會出錯：

>>> s.find('was\x9f')                   
traceback (most recent call last):
...unicodedecodeerror: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
>>> s.find(u'was\x9f')
-1

那麼如何判斷乙個字串到底是位元組序列還是 unicode 呢？從外觀上可以這樣判斷，凡是字串引號前面沒有加『u』的都是位元組串，加了『u』的是 unicode 字串。

in [1]: t = '中文'
in [2]: type(t)
out[2]: str
in [3]: t = u'中文'
in [4]: type(t)
out[4]: unicode

用**可以這樣判斷：

in [5]: def is_unicode(string):
...:         return isinstance(string, unicode)
...:
in [6]: is_unicode('中文')
out[6]: false
in [7]: is_unicode(u'中文')
out[7]: true

在 python 中字串都是用 unicode 表示的，所以當我們要對這些字串進行儲存或者傳輸時，就需要對這些字元進行適當的編碼，使這些字串轉換成適合儲存和傳輸的位元組序列。因此典型的處理流程是這樣的：

檔案讀取或鍵盤輸入的位元組序列 ---（解碼）---- 適合python 程式使用的 unicode 字串 -----（編碼）---- 適合儲存或傳輸的位元組序列形式

用符號可以這樣表示：

str -> decode('the_coding_of_str') -> unicode # 位元組串解碼得到 unicode

unicode -> encode('the_coding_you_want') -> str # unicode 編碼得到位元組串

不同編碼之間的轉換，需要使用 unicode 字串作為轉換的中間格式，轉換流程如下：

str.decode('the_coding_of_str').encode('the_prefered_coding_of_str')

encode 和 decode 函式使用須知，位元組串呼叫encode 或者 unicode 呼叫decode都會引起隱式轉換：

str.encode('***') -----> str.decode('ascii').encode('***')

uni.decode('***') ------> uni.encode('ascii').decode('***')

這樣，只要位元組串中碼值大於 127 或者 unicode 中字元的**點的值大於 127 就會丟擲，unicodedecodeerror 或 unicodeencodeerror。所以避免錯誤的乙個原則是：位元組串盡量不使用 encode函式，unicode 盡量不要使用 decode 函式。

Python 編碼小結

python 編碼小結

python 字元編碼學習小結二

字元編碼小結

Python 編碼小結

python 編碼小結

python 字元編碼學習小結 二

字元編碼小結

相關推薦

python 字元編碼學習小結二