NLP文字預處理的一些方法

寫在前面

隨著bert等技術的興起，在做文字方面比賽時，對於預處理這一塊像中文分詞，停用詞過濾，詞形還原，詞幹化，標點符號處理等變的不再這麼重要。當然也可以從另乙個角度來看，這些對於文字的預處理方法相當於減少輸入的雜訊，是可以讓神經網路更具有魯棒性的。所以以下內容可以作為乙個知識儲備在這裡，在工作中是否需要用到它們全憑自己判斷。

預處理方法

### 用於詞形還原
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import wordnetlemmatizer
# 獲取單詞的詞性
def get_wordnet_pos(tag):
if tag.startswith('j'):
return wordnet.adj
elif tag.startswith('v'):
return wordnet.verb
elif tag.startswith('n'):
return wordnet.noun
elif tag.startswith('r'):
return wordnet.adv
else:
return none
sentence = 'football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.'
tokens = word_tokenize(sentence)  # 分詞
tagged_sent = pos_tag(tokens)     # 獲取單詞詞性
wnl = wordnetlemmatizer()
lemmas_sent = 
for tag in tagged_sent:
wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.noun
print(lemmas_sent)

輸出結果為[『football』, 『be』, 『a』, 『family』, 『of』, 『team』, 『sport』, 『that』, 『involve』, 『,』, 『to』, 『vary』, 『degree』, 『,』, 『kick』, 『a』, 『ball』, 『to』, 『score』, 『a』, 『goal』, 『.』]

NLP文字預處理的一些方法

nlp中文字處理的一些常用方法

NLP 中文文字預處理

NLP系列文字預處理1

NLP文字預處理的一些方法

nlp中文字處理的一些常用方法

NLP 中文文字預處理

NLP系列 文字預處理1

相關推薦

NLP系列文字預處理1