1、分詞
nltk內建的分詞器
from nltk.tokenize import linetokenizer,spacetokenizer,tweettokenizer
from nltk import word_tokenize
linetokenizer字串拆分成行:
ltokenizer=linetokenizer();
print
(「output:」, ltokenizer.tokenize(「」)
)
spacetokenizer空格符分詞:
rawtext=」line…」
stokenizer= spacetokenizer(
)print
(「output:」, stokenizer.tokenize(rawtext)
)
tweettokenizer處理特殊字元
ttokenizer=tweettokenizer(
)print
(「output:」,ttokenizer.tokenize(「」)
)
2、詞幹提取
from nltk import porterstemmer,lancasterstemmer,word_tokenize
raw=」line…」 //分詞
tokens = word_tokenize(raw)
porter = porterstemmer(
)//去除字尾
pstems =
[porter.stem(t)
for t in tokens]
print
(pstems)
lancaster = lancasterstemmer(
)//包含更多的去字尾
lstems =
[lancaster.stem(t)
for t in tokens]
print
(lstems)
3、詞性還原(非專有名詞去除/替換字尾 專有名詞不替換
from nltk import wordnetlemmatizer(
)lemmas =
[lemmatizer. lemmatize(t)
for t in tokens]
print
(lemmas)
4、停用詞
i
mport nltk //載入語料庫
from nltk.corpus import gutenberg
print
(gutenberg.fileids)
//是否成功
gd_words = gutenberg.words(『bible-kjv.txt』)
//拷貝txt所有單詞列表
words_filtered =
[e for e in gd_words if
len(e)
>=3]
//遍歷並去除len
<
3的單詞
載入english停用詞到stopwords變數中;過濾掉所有停用詞
stopwords = nltk.corpus.stopwords.words(『english』)
words =
[w for w in words_filtered if w.lower(
)not
in stopwords]
5、編輯距離
from nltk.metrics.distance import edit_distance
def
my_edit_distance
(str1,str2)
:
獲取str長度,建立乙個m*n表
m=
len(str1)+1
n=len
(str2)
+1
建立乙個table並初始化第一行第一列
table=
for i in
range
(m):table[i,0]
=ifor j in
range
(n):table[j,0]
=j
填充矩陣
for i in
range(1
, m)
:for j in
range(1
, n)
:cost =o if str1[i-1]
== str2[j-1]
else
1table[i,j]
=min
(table[i, j-1]
+1, table[i-
1, j]+1
, table[i-
1,j-1]
+cost)
最終的編輯距離:
return table[i, j]
呼叫函式以及nltk包中的edit_distance()函式來分別計算兩個字串的編輯距離:
print
("our algorithm :"
, my_edit_distance (
"hand"
,"and"))
print
("nltk algorithm :"
,edit_distance (
"hand"
,"and"
))
NLTK 學習筆記(2)
pos速查表 標記含義 例子adj 形容詞new,good,high,special,big,local adv副詞 really,already,still,early,now cnj連詞 and,or,but,if,while,although det限定詞 the,a,some,most,ev...
NLTK學習筆記
學習參考書 nltk.set proxy com 80 nltk.download 2.使用sents fileid 函式時候出現 resource tokenizers punkt english.pickle not found.please use the nltk to obtain the...
NLTK學習筆記
學習參考書 nltk.set proxy com 80 nltk.download 2.使用sents fileid 函式時候出現 resource tokenizers punkt english.pickle not found.please use the nltk to obtain the...