Python排除非法字元干擾讀取UTF 8檔案

最近做乙個關於正則匹配的專案，用open()開啟utf-8格式的檔案，讀取每一行的內容;由於一些檔案中存在非utf-8標準的字元，指令碼執行會報錯。在debug過程中發現，實際上不論你寫的是read(1)(讀取乙個位元組的內容)還是readline()(讀取一行的內容)，python庫函式會一次性讀取一大塊內容，一旦這塊資料中有非法位元組，整個呼叫就會出錯。

例如以下**讀取每一行內容並列印，實際上含有非法字元的行以及前後若干行都不會被列印出來。

while true:
try:
line = file.readline()
print(line, end='')
except:
continue
if line == '':
break

測試讀取example.txt檔案包含1000行的以下內容，注意在第500行前面新增了乙個非法位元組0xa0：

this is line no. 0……

…this is line no. 488

this is line no. 489

this is line no. 490

this is line no. 491

this is line no. 492

this is line no. 493

this is line no. 494

this is line no. 495

this is line no. 496

this is line no. 497

this is line no. 498

this is line no. 499

非法字元0xa0this is line no. 500

this is line no. 501

this is line no. 502

this is line no. 503

this is line no. 504

this is line no. 505

this is line no. 506

this is line no. 507……

…this is line no. 999

最終執行python**的輸出為：

... this is line no. 383 this is line no. 384 this is line no. 385 this is line no. 386 this is line no. 387 this is line no. 388 this is line no. 389 this is line no. 390 this is line no. 391 this is line no. 392 this is line no. 393 this is line no. 394 line no. 785 this is line no. 786 this is line no. 787 this is line no. 788 this is line no. 789 this is line no. 790 this is line no. 791 this is line no. 792 this is line no. 793 this is line no. 794 this is line no. 795 this is line no. 796 this is line no. 797

...

可以看到第500行之前到395行和之後到785行的內容實際上被讀取了，並且因為包含非法字元而呼叫出錯，這些行都沒有顯示。

如果要忽略這些非法字元，正常讀取某行的其他內容，可以在開啟的時候傳遞引數 errors = 『ignore』。

file = open('/home/hyphen/example.txt', mode='r', errors='ignore')

這樣再執行**，就可以正常讀取了。

this is line no. 484 this is line no. 485 this is line no. 486 this is line no. 487 this is line no. 488 this is line no. 489 this is line no. 490 this is line no. 491 this is line no. 492 this is line no. 493 this is line no. 494 this is line no. 495 this is line no. 496 this is line no. 497 this is line no. 498 this is line no. 499 this is line no. 500 this is line no. 501 this is line no. 502 this is line no. 503 this is line no. 504 this is line no. 505 this is line no. 506 this is line no. 507 this is line no. 508 this is line no. 509 this is line no. 510 this is line no. 511

this is line no. 512

Python排除非法字元干擾讀取UTF 8檔案

關於Python文件讀取UTF 8編碼檔案問題

Go語言讀取帶有BOM頭的UTF8檔案

Python去除文字中非utf8字元

Python排除非法字元干擾讀取UTF 8檔案

關於Python文件讀取UTF 8編碼檔案問題

Go語言 讀取帶有BOM頭的UTF8檔案

Python去除文字中非utf8字元

相關推薦

Go語言讀取帶有BOM頭的UTF8檔案