NLP學習（三）詞典

import nltk
defunusual_words
(text)
:#輸出不常見的詞
text_vocab =
set(w.lower(
)for w in text if w.isalpha())
english_vocab =
set(w.lower(
)for w in nltk.corpus.words.words())
unusual = text_vocab.difference(english_vocab)
#兩者之間的差別詞
return
sorted
(unusual)
print
(unusual_words(nltk.corpus.gutenberg.words(
'austen-sense.txt'))
)

from nltk.corpus import stopwords #停用詞(自動忽略高頻無意義的詞彙)
import nltk
stopwords.words(
'english'
)def
content_fraction
(text)
:#輸出有意義詞語佔整個文字的百分比
stopwords = nltk.corpus.stopwords.words(
'english'
)    content =
[w for w in text if w.lower(
)not
in stopwords]
#不是停用詞的詞語，即有意義的詞
print
(content)
return
len(content)
/len
(text)
print
(content_fraction(nltk.corpus.reuters.words())
)

import nltk #乙個遊戲，
puzzle_letters = nltk.freqdist(
'egivrvonl'
)obligatory =
'r'wordlist = nltk.corpus.words.words(
)print
([w for w in wordlist if
len(w)
>=
6and obligatory in w
and nltk.freqdist(w)
<= puzzle_letters]
)#從此處限定了出現的字母和出現字母的數量

import nltk
names = nltk.corpus.names #姓名的詞彙表
names.fileids(
)male_names = names.words(
'male.txt'
)#男性姓名
female_names = names.words(
'female.txt'
)#女性姓名
print
([w for w in male_names if w in female_names]
)#男女中通用的姓名-重複部分

import nltk
names = nltk.corpus.names #姓名的詞彙表
#形成多個音束
#輸出單詞，中間的音束，數字表示重音
#搜尋同義詞集-[synset('car.n.01')]-car這個單詞的第乙個意思
print
(wn.synset(
'car.n.01'
).lemma_names())
#獲得同類詞
print
(wn.synset(
'car.n.01'
).definition())
#獲得意思的定義
print
(wn.synset(
'car.n.01'
).examples())
#獲得使用的例子
from nltk.corpus import wordnet as wn
print
(wn.synset(
'car.n.01'
).lemmas())
#查詢同義詞條
print
(wn.lemma(
'car.n.01.automobile'))
#查詢某一特定的詞條
print
(wn.lemma(
'car.n.01.automobile'
).synset())
#某一詞條的同義詞集
print
(wn.lemma(
'car.n.01.automobile'
).name())
#某一詞條的名稱
for synset in wn.synsets(
'car'):
#訪問包含car的同義詞條的同義詞
print
(synset.lemma_names())
motorcar = wn.synset(
'car.n.01'
)types_of_motorcar = motorcar.hyponyms(
)#訪問下位詞集
sorted
([lemma.name(
)for synset in types_of_motorcar for lemma in synset.lemmas()]
)#對下位詞集進行排序
motorcar.hypernyms(
)#上位詞

NLP系列中文分詞（基於詞典）

詞是最小的能夠獨立活動的有意義的語言成分，一般分詞是自然語言處理的第一項核心技術。英文中每個句子都將詞用空格或標點符號分隔開來，而在中文中很難對詞的邊界進行界定，難以將詞劃分出來。在漢語中，雖然是以字為最小單位，但是一篇文章的語義表達卻仍然是以詞來劃分的。因此處理中文文字時，需要進行分詞處理，將句子...

系統學習NLP（三） NLP入門綜述

從這個月開始，進入nlp方向了，因此，系統了看了一遍自然語言處理綜論對nlp做了點系統性的了解，後面抽時間乙個乙個業務或者方向進行實現學習。這裡主要是入門的認識，因此，大多數不涉及每個業務應用的最佳實現，比如基本沒有深度學習層面的因為那本書只總結了2009年之前的不過有了這個基礎，每個業務應...

NLP學習筆記 nlp入門介紹

為什麼計算機難以理解人類的自然語言呢？主要是下面6個特性詞彙量在自然語言中含有很豐富的詞彙，而程式語言中能使用的關鍵字數量是有限的結構化自然語言是非結構化的，而程式語言是結構化的，例如類和成員。自然語言是線性字串，要分析它，需要用到分詞命名實體識別指代消解和關係抽取等。歧義性我們說話含有大量...

NLP學習（三） 詞典

NLP系列 中文分詞（基於詞典）

系統學習NLP（三） NLP入門綜述

NLP學習筆記 nlp入門介紹

相關推薦

NLP學習（三）詞典

NLP系列中文分詞（基於詞典）