NLP中文短文字處理記錄（一）

nlp短文字處理

nlp之文字清洗

nlp一些常用詞

待讀文章

第二天學習

寫**前要想好資料該用什麼格式輸出，最好在紙上畫出來，然後想如何能實現

read_csv()讀取含中文的csv檔案時，encoding='utf-8'或'gb18030'，

會遇到有的行無法讀出，增加引數error_bad_lines=false

處理不規範的json檔案

import json
mess=
with open('謠言.json','r',encoding='utf-8')as f:
lines=f.readlines()
i=0for line in lines:
print(line)
data=json.loads(line)
#i+=1
#if i>200:
#break
print(mess)

莫凡python

系列nltk文章，沒看

推薦how to solve 90% of nlp problems: a step-by-step guide

python3中的translate函式

>>table=str.maketrans('ab','yz')
>>'abcdefg...xyz'.translate(table)
>>'yzcdefg...xyz'

這個也不好用，只是英文本元標點

>>string.punctuation
>>'!"#$%&\'()*+,-./:;<=>?@[\\]^_`~'

總結以上兩個tip如何刪除停用詞

#去除中文標點符號
intab='，。？：「」『』；-@'
table=str.maketrans(intab,' '*len(intab))
#可寫作table=str.maketrans('','',intab)，上面的寫法會帶有乙個空格
text.translate(table)
#去除英文標點
table=str.maketrans('','',string.punctuation)
#建議停用詞不應這樣刪除
#他會把intab中的字元都去掉

參考：

how to clean text for machine learning with python可參考譯文

譯：nltk清洗英文文字

自然語言處理關鍵術語

natural language processing

tokenization

normalization

stemming

lemmatization

corpus

stop words

parts-of-speech(pos) tagging

statistical language modeling

bag of words

n-grams

regular expressions

zipf』s law

similarity measures

syntactic analysis

semantic analysis

sentiment analysis

cs224n筆記2 詞的向量表示：word2vec

秒懂詞向量word2vec的本質

機器學習與scikit learn學習庫

封裝中文分詞

import jieba
import jieba.posseg as pseg
from jieba.analyse import extract_tags
import re
import pandas as pd
deftext_cut
(filename,allowpos=['n','ns','nr','nr2']):
"""    :param filename: 檔案路徑
:param allowpos: 選擇需要的詞性
:return: 返回乙個dateframe
"""path='d:\\pycharm 2017.3\\pyprojects\\rumors\\venv\\include\\data\\'
jieba.load_userdict(path+'userdict.txt')
f=open(path+filename,'r',encoding='gb18030')
context=f.read()
#把文字按句號等標點分隔開，並刪除換行符
sentence=[i.replace('\n','').strip() for i in re.split('。|！',context)]
#對每一句進行分詞
data=
stop_words=open(path+'stop_words.txt','r',encoding='utf-8').read()
for s in sentence:
#將每個句子分詞
con=[item for item in jieba.lcut(s) if len(item)>1
and item not
in stop_words]
#提取每句中的所需詞性
seg=pseg.cut(s)
seg_list=['%s'%item for item in seg if item.flag in allowpos and len(item.word)>1]
df_text=pd.dataframe(data,columns=['sentence','posseg'])
return df_text

待續。。。

nlp中文字處理的一些常用方法

從sentence str 中找到會重複出現的多位的keyword的起始位置與結束位置的索引def get key idxs sentence,keyword k len len keyword res for i in range len sentence k len 1 if sentence ...

NLP 文字處理的小問題

參考文章問題搜狗預料庫的資料型別編碼格式為 gb18030 這種文字檔案不是標準的xml檔案，沒有根節點。因此要新增根節點使該文字檔案符合xml檔案的規範可以用xml樹操作對象，可以用mysql載入xml資料當資料量極大時不能用windoms開啟記事本手動新增。方法有多種直接linux...

linux 文字處理一

linux中的一切都是檔案，所以對文字的處理變得很重要，下面介紹幾個常見的文字處理命令 cat cat的基本用法如下上圖中我們建立了cattest.txt檔案，文字的開頭是乙個製表位，第一行位置有乙個換行。cat常用選項如下在linux中有很多控制符號，如下圖所示注上表中的 k都可以使用ct...

NLP中文短文字處理記錄（一）

nlp中文字處理的一些常用方法

NLP 文字處理的小問題

linux 文字處理一

相關推薦