自然語言2 常用函式

相同愛好者**

自然語言，nlp，nltk,python，tokenization，normalization,linguistics,semantic

乙個nlp愛好者部落格

2. 使用sents(fileid)函式時候出現：resource 'tokenizers/punkt/english.pickle' not found. please use the nltk ********** to obtain the resource:

import nltk

nltk.download()

3. 語料corpus元素獲取函式

from nltk.corpus import webtext

webtext.fileids() #得到語料中所有檔案的id集合

webtext.raw(fileid) #給定檔案的所有字元集合

webtext.words(fileid) #所有單詞集合

webtext.sents(fileid) #所有句子集合

example

description

fileids()

the files of the corpus

fileids([categories])

the files of the corpus corresponding to these categories

categories()

the categories of the corpus

categories([fileids])

the categories of the corpus corresponding to these files

raw()

the raw content of the corpus

raw(fileids=[f1,f2,f3])

the raw content of the specified files

raw(categories=[c1,c2])

the raw content of the specified categories

words()

the words of the whole corpus

words(fileids=[f1,f2,f3])

the words of the specified fileids

words(categories=[c1,c2])

the words of the specified categories

sents()

the sentences of the whole corpus

sents(fileids=[f1,f2,f3])

the sentences of the specified fileids

sents(categories=[c1,c2])

the sentences of the specified categories

abspath(fileid)

the location of the given file on disk

encoding(fileid)

the encoding of the file (if known)

open(fileid)

open a stream for reading the given corpus file

root()

the path to the root of locally installed corpus

readme()

the contents of the readme file of the corpus

4.文字處理的一些常用函式

假若text是單詞集合的列表

len(text) #單詞個數

set(text) #去重

sorted(text) #排序

text.count('a') #數給定的單詞的個數

text.index('a') #給定單詞首次出現的位置

freqdist(text) #單詞及頻率，keys()為單詞，*[key]得到值

freqdist(text).plot(50,cumulative=true) #畫累積圖

bigrams(text) #所有的相鄰二元組

text.collocations() #找文字中頻繁相鄰二元組

text.concordance("word") #找給定單詞出現的位置及上下文

text.similar("word") #找和給定單詞語境相似的所有單詞 ???

text.common_context("a「,"b") #找兩個單詞相似的上下文語境

text.dispersion_plot(['a','b','c',...]) #單詞在文字中的位置分布比較圖

text.generate() #隨機產生一段文字

nlp 自然語言處理python常用函式

1.count 方法返回特定的子串在字串現的次數。2.strip 方法可以去除字串首尾的指定符號。無指定時，預設去除空格符和換行符 n 3.需要將字串用特定的符號拼接起來的字元的時候，可以用 join 方法來進行拼接。seq 2018 10 31 seq join seq 用拼接4.在處理英文...

自然語言處理

自然語言處理主要步驟包括 2.詞法分析對於英文，有詞頭詞根詞尾的拆分，名詞動詞形容詞副詞介詞的定性，多種詞意的選擇。比如diamond，有菱形棒球場鑽石3個含義，要根據應用選擇正確的意思。3.語法分析通過語法樹或其他演算法，分析主語謂語賓語定語狀語補語等句子元素。4.語...

自然語言處理

前言自然語言處理 natural language processing 是計算科學領域與人工智慧領域中的乙個重要方向。它研究能實現人與計算機之間用自然語言進行有效通訊的各種理論和方法。自然語言處理是一門融語言學電腦科學數學於一體的科學。因此，這一領域的研究將涉及自然語言，即人們日常使用的語言...

自然語言2 常用函式

nlp 自然語言處理python常用函式

自然語言處理

自然語言處理

相關推薦