最近做乙個關於正則匹配的專案,用open()開啟utf-8格式的檔案,讀取每一行的內容;由於一些檔案中存在非utf-8標準的字元,指令碼執行會報錯。在debug過程中發現,實際上不論你寫的是read(1)(讀取乙個位元組的內容)還是readline()(讀取一行的內容),python庫函式會一次性讀取一大塊內容,一旦這塊資料中有非法位元組,整個呼叫就會出錯。
例如以下**讀取每一行內容並列印,實際上含有非法字元的行以及前後若干行都不會被列印出來。
while true:
try:
line = file.readline()
print(line, end='')
except:
continue
if line == '':
break
測試讀取example.txt檔案包含1000行的以下內容,注意在第500行前面新增了乙個非法位元組0xa0:
this is line no. 0……
…this is line no. 488
this is line no. 489
this is line no. 490
this is line no. 491
this is line no. 492
this is line no. 493
this is line no. 494
this is line no. 495
this is line no. 496
this is line no. 497
this is line no. 498
this is line no. 499
非法字元0xa0this is line no. 500
this is line no. 501
this is line no. 502
this is line no. 503
this is line no. 504
this is line no. 505
this is line no. 506
this is line no. 507……
…this is line no. 999
最終執行python**的輸出為:
...
this is line no. 383
this is line no. 384
this is line no. 385
this is line no. 386
this is line no. 387
this is line no. 388
this is line no. 389
this is line no. 390
this is line no. 391
this is line no. 392
this is line no. 393
this is line no. 394
line no. 785
this is line no. 786
this is line no. 787
this is line no. 788
this is line no. 789
this is line no. 790
this is line no. 791
this is line no. 792
this is line no. 793
this is line no. 794
this is line no. 795
this is line no. 796
this is line no. 797
...
可以看到第500行之前到395行和之後到785行的內容實際上被讀取了,並且因為包含非法字元而呼叫出錯,這些行都沒有顯示。
如果要忽略這些非法字元,正常讀取某行的其他內容,可以在開啟的時候傳遞引數 errors = 『ignore』。
file = open('/home/hyphen/example.txt', mode='r', errors='ignore')
這樣再執行**,就可以正常讀取了。
this is line no. 484
this is line no. 485
this is line no. 486
this is line no. 487
this is line no. 488
this is line no. 489
this is line no. 490
this is line no. 491
this is line no. 492
this is line no. 493
this is line no. 494
this is line no. 495
this is line no. 496
this is line no. 497
this is line no. 498
this is line no. 499
this is line no. 500
this is line no. 501
this is line no. 502
this is line no. 503
this is line no. 504
this is line no. 505
this is line no. 506
this is line no. 507
this is line no. 508
this is line no. 509
this is line no. 510
this is line no. 511
this is line no. 512
關於Python文件讀取UTF 8編碼檔案問題
近來接到乙個小專案,讀取目標檔案中每一行url,並逐個請求url,拿到想要的資料。coding utf 8 class ipurlmanager object def init self self.newipurls set self.oldipurls set defis has ipurl se...
Go語言 讀取帶有BOM頭的UTF8檔案
bom頭是utf8檔案開頭的三個固定取值的位元組,讀檔案的時候如果遇到bom頭需要忽略。在golang裡,比較有效率的方法是用ioutil.readfile,返回byte之後擷取從第四個位元組到末尾的切片。因為由切片擷取生成的新切片和原切片共同指向同乙個陣列,所以沒有額外的拷貝,這一點比轉換成字串之...
Python去除文字中非utf8字元
在處理文件相關專案中,經常會碰到utf8的非法字元,例如使用者上傳乙個檔案,系統根據使用者檔案產生相應結果返回。如果使用者檔案 utf編碼的csv檔案 中有utf8的非法字元,需要程式能自動去掉這些字元,因為這些字元也是無意義的。錯誤資訊 utf 8 codec can t decode byte ...