NLTK（標註詞彙）

nltk.word_tokenize（text）：對指定的句子進行分詞，返回單詞列表。

nltk.pos_tag(words)：對指定的單詞列表進行詞性標記，返回標記列表。

import nltk
words = nltk.word_tokenize('and now for something completely different')
print(words)
word_tag = nltk.pos_tag(words)
print(word_tag)

['and', 'now', 'for', 'something', 'completely', 'different']
[('and', 'cc'), ('now', 'rb'), ('for', 'in'), ('something', 'nn'), ('completely', 'rb'), ('different', 'jj')]

str2tuple(s, sep='/')
given the
string representation of
a tagged token, return
the    corresponding tuple representation.  the rightmost occurrence of
*sep* in *s* will be used to
divide *s* into
aword
string
anda tag string.  if *sep* does not occur in *s*, return (s, none).
from nltk.tag.util import str2tuple
str2tuple('fly/nn')
('fly', 'nn')
:type s: str
:param s: the string representation of
a tagged token.
:type sep: str
:param sep: the separator string used to separate word strings
from tags.

標記會轉成大寫

預設sep=』/』

t = nltk.str2tuple('fly~abc',sep='~')
tout[26]: ('fly', 'abc')

t = nltk.str2tuple('fly/abc')
tout[28]: ('fly', 'abc')

from nltk.corpus import brown
words_tag = brown.tagged_words(categories='news')
print(words_tag[:10])

[('the', 'at'), ('fulton', 'np-tl'), ('county', 'nn-tl'), ('grand', 'jj-tl'), ('jury', 'nn-tl'), ('said', 'vbd'), ('friday', 'nr'), ('an', 'at'), ('investigation', 'nn'), ('of', 'in')]

簡化的標記原先的 simplify_tags 在 python 3 中改為 tagset

words_tag = brown.tagged_words(categories='news',tagset = 'universal')
print(words_tag[:10])

[('the', 'det'), ('fulton', 'noun'), ('county', 'noun'), ('grand', 'adj'), ('jury', 'noun'), ('said', 'verb'), ('friday', 'noun'), ('an', 'det'), ('investigation', 'noun'), ('of', 'adp')]

brown可以看作是乙個categorizedtaggedcorpusreader例項物件。

categorizedtaggedcorpusreader::tagged_words(fileids, categories)：該方法接受文字標識或者類別標識作為引數，返回這些文字被標註詞性後的單詞列表。

categorizedtaggedcorpusreader::tagged_sents(fileids, categories)：該方法接受文字標識或者類別標識作為引數，返回這些文字被標註詞性後的句子列表，句子為單詞列表。

tagged_sents = brown.tagged_sents(categories='news')
print(tagged_sents)

[[('the', 'at'), ... ('.', '.')], 
[('the', 'at'), ...('jury', 'nn').. ],
...]

NLTK詞性標註編碼含義

1.cc coordinating conjunction 連線詞 2.cd cardinal number 基數詞 3.dt determiner 限定詞如this,that,these,those,such，不定限定詞 no,some,any,each,every,enough,either,...

NLTK之詞性 POS 標註

詞性 pos 目前最先進的詞性標註演算法在給定單詞的詞性上已經有了較高的精確度約97 但詞性標註領域中仍有大量的研究等著我們。pos標記器 n gram標註器正規表示式標註器最大熵分類器 mec 隱性馬爾可夫模型 hmm 條件隨機場 crf import nltk from nltk imp...

NLTK中文詞性自動標註

學習自然語言處理，一定會參考nltk,主要是學習它的思路,從設計地角度看看能做什麼.其本質就是把語言看成字串，字串組，字串集，尋找其間規律 nltk是多語言支援的,但目前網上的例程幾乎沒有用nltk處理中文的，其實可以做。比如標註功能,它自身提供了帶標註的中文語庫繁體語料庫sinica treeb...

NLTK（標註詞彙）

NLTK詞性標註編碼含義

NLTK之詞性 POS 標註

NLTK中文詞性自動標註

相關推薦