NLP 中文文字預處理

jieba是乙個專門處理中文分詞的分詞庫,但其實功能比單純的分詞強大許多。

中文不同於英文可以通過空格分開每個有意義的詞，對於中文需要乙個工具將完整的文字分割成更細緻的詞語，類似於英文分詞中使用的nltk工具，中文中需要使用jieba。

pip install jieba

4.詞性標註

5.tokenize

6.去除停止詞

import jieba
word = jieba.cut('他去了杭研大廈',cut_all=false)
#精確模式
#結果：他/去/了/杭研/大廈/
word = jieba.cut_for_search('小明碩士畢業於中國科學院計算所')
print('，'.join(word))
#結果：小明，碩士，畢業，於，中物，中物院，計算，計算所
word = jieba.lcut_for_search("小明碩士畢業於中物院計算所")
print(' '.join(word))
#小明 碩士 畢業 於 中物 中物院 計算 計算所

當對於特定場景進行分詞時，會出現一些領域的專有詞彙，此時進行分詞需要使用我們自定義的詞典。

print('/'.join(word))

jieba.add_word('深度學習',freq = none,tag = none)

print('/'.join(word))

print('/'.join(jieba.cut('如果放在舊字典中將出錯',hmm = false)))

如果/放在/舊/字典/中將/出錯

jieba.suggest_freq(('中','將'),true)

print('/'.join(jieba.cut('如果放在舊字典中將出錯',hmm = false)))

#輸出：如果/放在/舊/字典/中/將/出錯

jieba.suggest_freq(('舊字典'),true)

print('/'.join(jieba.cut('如果放在舊字典中將出錯',hmm = false)))

#輸出：如果/放在/舊字典/中/將/出錯

import jieba.analyse

jieba.analyse.extract_tag(sentence,topk=20,withweight=false,allowpos=())

import jieba.analyse as analyse
lines = open('file_name').read()
#filename待提取文字的檔名稱
print(' '.join(analyse.extract_tags(lines,topk=20,withweight=false,allowpos=())))

需要注意的幾點內容

jieba.analyse.textrank(sentence,topk=20,withweight=false,allowpos=(『ns』,』n』,』vn』,』v』))

僅提取地名，名詞，動名詞，動詞

**原文

演算法思想：

以固定視窗大小(預設為5，通過span屬性調整)，詞之間的共現關係，構建圖

計算圖中節點的pagerank，注意是無向帶權圖

import jieba.posseg as pseg
words = pseg.cut('我家住在黃土高坡')
for word,flag in words:
...     print(('%s %s' %(word,flag)))
...#輸出：
我 r家住 v
在 p黃土 n
高坡 nr

jieba.tokenize()返回詞語在原文的起止位置

seg = jieba.tokenize('我家住在黃土高坡')
for s in seg:
...     print(('%s\t\t start:%d\t\t end:%d' %(s[0],s[1],s[2])))
...#輸出：
我        start:0         end:1
家住       start:1         end:3
在        start:3         end:4
黃土       start:4         end:6
高坡       start:6         end:8

def
stopwordslist
(filepath):
stopwords=[line.strip() for line in open(filepath,'r',encoding='utf-8').readlines()]
return stopwords

stopwords=stopwordslist('../input/stop_word.txt')

中文文字預處理主題模型

去掉低頻詞分詞繁簡轉化替換奇異詞等是中文文字資料處理中的重要步驟。如在主題模型中，資料預處理的要求很簡單，只需要去掉低頻詞，大多數文章設定的門限都是5或者6.中文文字相比於英文，需要首先進行分詞處理。類似地還有日文韓文等。因而自然預言處理中乙個重要的研究問題就是文字分詞技術。兩者都有pyth...

nlp 中文資料預處理

資料載入預設csv格式 import pandas as pd datas pd.read csv test.csv header 0,index col 0 dataframe n datas data.to numpy ndarray 轉成numpy更好處理個人喜好去除空行def dele...

NLP 中文文字分類詳細

實現如下customprocessor class customprocessor dataprocessor def get train examples self,data dir return self.create examples self.read tsv os.path.join da...

NLP 中文文字預處理

中文文字預處理 主題模型

nlp 中文資料預處理

NLP 中文文字分類 詳細

相關推薦

中文文字預處理主題模型

NLP 中文文字分類詳細