Python去除不可見字元，如 u200b

爬蟲時遇到不可見字元時無法匯入資料庫中，報錯mysql.connector.errors.databaseerror: 1267 (hy000): illegal mix of collations (gbk_chinese_ci,implicit) and (utf8_general_ci,coercible) for operation '='

遇到的不可見字元有：

\u200b \ufeff

\ue601

還有其他不可見字元，可用repr(s)顯示其原本的樣子

s =
chr(
8204)+
chr(
8205
)print
(s)print
(repr
(s))
# '\u200c\u200d'

利用str.isprintable()，如果字串中所有字元均為可列印字元或空字串則返回true，否則返回false

print(''
.isprintable())
# true
print
('a'
.isprintable())
# true
print
('a\u2029'
.isprintable())
# false

移除所有不可見字元

def
remove_upprintable_chars
(s):
"""移除所有不可見字元"""
return
''.join(x for x in s if x.isprintable())
s ='a\u2029b'
print
(s.isprintable(
), s)
# false
s = remove_upprintable_chars(s)
print
(s.isprintable(
), s)
# true

備選方案如下：

s = s.replace(
'\u200b',''
)

另外，參考文獻通過該方法去除控制字元，本人測試無效

import re
import itertools
defremove_control_chars
(s):
control_chars =
''.join(
map(
chr, itertools.chain(
range
(0x00
,0x20),
range
(0x7f
,0xa0))
))control_char_re = re.
compile
('[%s]'
% re.escape(control_chars)
)return control_char_re.sub(
'', s)

python - ******json encoding issue: illegal character

python 去除不可見的控制字元

stripping non printable characters from a string in python

unicodedata — unicode 資料庫

python批量替換字串內容

mysql字符集錯誤error 1267 (hy000)解決方法

去除 \ufeff

如何移除所有不可見字元？

str.isprintable() — python文件

python 去除不可見字元 x00

現象但是將收到的資料複製貼上成字串就可以接續出來。糾結了很久才發現，兩個長度不一樣。str是看不出來的，於是就轉換成了bytes，發現收到的資料為而複製出來的字串沒有 x00 原 self.data self.request.recv 1024 decode utf 8 ignore stri...

不可見的unicode字元

專案中執行到如 x x.encode encoding 報錯 latin 1 codec can t encode character u u202d in position 0 ordinal not in range 256 可見是編碼問題。報錯資訊顯示這個x字串中含有異常的字元u u202d...

Vim中顯示不可見字元

在linux中，cat a file可以把檔案中的所有可見的和不可見的字元都顯示出來，在vim中，如何將不可見字元也顯示出來呢？當然，如果只是想在vim中檢視的話，可以這樣 cat a在vim中呼叫cat轉換顯示。這樣的做法不便於編輯，其實vim本身是可以設定顯示不可見字元的。只需要 set i...

Python去除不可見字元，如 u200b

python 去除不可見字元 x00

不可見的unicode字元

Vim中顯示不可見字元

相關推薦