1、分詞
nltk內建的分詞器
linetokenizer字串拆分成行:from nltk.tokenize import linetokenizer,spacetokenizer,tweettokenizer
from nltk import word_tokenize
spacetokenizer空格符分詞:ltokenizer=linetokenizer();
(「output:」, ltokenizer.tokenize(「」)
)
tweettokenizer處理特殊字元rawtext=」line…」
stokenizer= spacetokenizer(
(「output:」, stokenizer.tokenize(rawtext)
)
2、詞幹提取ttokenizer=tweettokenizer(
(「output:」,ttokenizer.tokenize(「」)
)
3、詞性還原(非專有名詞去除/替換字尾 專有名詞不替換from nltk import porterstemmer,lancasterstemmer,word_tokenize
raw=」line…」 //分詞
tokens = word_tokenize(raw)
porter = porterstemmer(
)//去除字尾
pstems =
[porter.stem(t)
for t in tokens]
(pstems)
lancaster = lancasterstemmer(
)//包含更多的去字尾
lstems =
[lancaster.stem(t)
for t in tokens]
(lstems)
4、停用詞from nltk import wordnetlemmatizer(
)lemmas =
[lemmatizer. lemmatize(t)
for t in tokens]
(lemmas)
i
5、編輯距離mport nltk //載入語料庫
from nltk.corpus import gutenberg
(gutenberg.fileids)
//是否成功
gd_words = gutenberg.words(『bible-kjv.txt』)
//拷貝txt所有單詞列表
words_filtered =
[e for e in gd_words if
len(e)
>=3]
//遍歷並去除len
<
3的單詞
載入english停用詞到stopwords變數中;過濾掉所有停用詞
stopwords = nltk.corpus.stopwords.words(『english』)
words =
[w for w in words_filtered if w.lower(
)not
in stopwords]
from nltk.metrics.distance import edit_distance
獲取str長度,建立乙個m*n表def
my_edit_distance
(str1,str2)
:
建立乙個table並初始化第一行第一列m=
len(str1)+1
n=len
(str2)
+1
填充矩陣table=
for i in
range
(m):table[i,0]
=ifor j in
range
(n):table[j,0]
=j
最終的編輯距離:for i in
range(1
, m)
:for j in
range(1
, n)
:cost =o if str1[i-1]
== str2[j-1]
else
1table[i,j]
=min
(table[i, j-1]
+1, table[i-
1, j]+1
, table[i-
1,j-1]
+cost)
return table[i, j]
呼叫函式以及nltk包中的edit_distance()函式來分別計算兩個字串的編輯距離:
("our algorithm :"
, my_edit_distance (
"hand"
,"and"))
("nltk algorithm :"
,edit_distance (
"hand"
,"and"
))
NLTK 學習筆記(2)
pos速查表 標記含義 例子adj 形容詞new,good,high,special,big,local adv副詞 really,already,still,early,now cnj連詞 and,or,but,if,while,although det限定詞 the,a,some,most,ev...
NLTK學習筆記
學習參考書 nltk.set proxy com 80 nltk.download 2.使用sents fileid 函式時候出現 resource tokenizers punkt english.pickle not found.please use the nltk to obtain the...
NLTK學習筆記
學習參考書 nltk.set proxy com 80 nltk.download 2.使用sents fileid 函式時候出現 resource tokenizers punkt english.pickle not found.please use the nltk to obtain the...