檢測位元組流是否是UTF8編碼

utf8的編碼規則總結起來如下：

ascii碼（u+0000 - u+007f），不編碼

其餘編碼規則為

•第乙個byte二進位制以形式為n個1緊跟個0 (n >= 2), 0後面的位數用來儲存真正的字元編碼，n的個數說明了這個多byte位元組組位元組數（包括第乙個byte）

•接下來會有n個以10開頭的byte，後6個bit儲存真正的字元編碼。

因此對整個編碼byte流進行分析可以得出是否是utf8編碼的判斷。

根據這個規則，我給出的c#**如下：

public static bool istextutf8(ref byte inputstream)

int encodingbytescount = 0;

bool alltextsareasciichars = true;

for (int i = 0; i < inputstream.length; i++)

byte current = inputstream[i];

if ((current & 0x80) == 0x80)

alltextsareasciichars = false;

// first byte

if (encodingbytescount == 0)

if ((current & 0x80) == 0)

// ascii chars, from 0x00-0x7f

continue;

if ((current & 0xc0) == 0xc0)

encodingbytescount = 1;

current <<= 2;

// more than two bytes used to encoding a unicode char.

// calculate the real length.

while ((current & 0x80) == 0x80)

current <<= 1;

encodingbytescount++;

else

// invalid bits structure for utf8 encoding rule.

return false;

else

// following bytes, must start with 10.

if ((current & 0xc0) == 0x80)

encodingbytescount--;

else

// invalid bits structure for utf8 encoding rule.

return false;

if (encodingbytescount != 0)

// invalid bits structure for utf8 encoding rule.

// wrong following bytes count.

return false;

// although utf8 supports encoding for ascii chars, we regard as a input stream, whose contents are all ascii as default encoding.

return !alltextsareasciichars;

另：如果是判斷乙個檔案是否使用了utf8編碼，不一定非用這種方法，因為通常以utf8格式儲存的檔案最初兩個字元是bom頭，標示該檔案使用了utf8編碼。

檢測位元組流是否是UTF8編碼

幾天前偶爾看到有人發帖子問如何自動識別判斷url中的中文引數是gb2312還是utf 8編碼也拜讀了wcwtitxu使用巨牛的正規表示式檢測utf8編碼的演算法。使用無數或條件的正規表示式用起來卻是效能不高。先聊聊原理 utf8的編碼規則如下表看起來很複雜，總結起來如下 ascii碼 u 00...

python 檢測是否是UTF 8編碼

utf 8 8 bit unicode transformation format 是一種針對unicode的可變長度字元編碼，又稱萬國碼，由ken thompson於1992年建立。現在已經標準化為rfc 3629。utf 8用1到6個位元組編碼unicode字元。用在網頁上可以統一頁面顯示中文簡...

PHP檢測字串是否為UTF8編碼的常用方法

例子1複製如下檢測字元程式設計客棧串是否為utf8編碼 param string str 被檢測的字串 return boolean function is utf8 str return true 例子2 複製如下 function is utf8 string straight 3 byt...

檢測位元組流是否是UTF8編碼

檢測位元組流是否是UTF8編碼

python 檢測是否是UTF 8編碼

PHP檢測字串是否為UTF8編碼的常用方法

相關推薦