中文文字資料結構化處理例項

import jieba
import gensim
from gensim import corpora
from gensim.matutils import corpus2dense
f = open("背影.txt", "r+")   #選取文件為中文的《背影》，將文件放在當前目錄下
text1 = f.readlines()        #讀取檔案，按行讀取，存入列表
read = text1
#text1 = f.read()            #直接全部讀取，是乙個字串
#text1.splitlines()          #按照 /n 切分
f.close()
text2 = f.read()
stop_word = text2.splitlines()
text = 
for i in range(len(read)):                       #逐行讀取   
seg_useful = 
segs = jieba.cut(read[i])                    #結巴分詞，注意結巴分詞只能針對字串，無法處理列表
for seg in segs:
if seg not in stop_word:                 #刪除停用詞
dictionary = corpora.dictionary(text)            #建立字典
word_count = [dictionary.doc2bow(text[i]) for i in range(len(text))]    #建立文件-詞項矩陣
dtm_matrix = corpus2dense(word_count, len(dictionary))   
dtm_matrix.t
from gensim import models
print(len(word_count))
tfidf_model = models.tfidfmodel(word_count)     #建立tfidf模型
tfidf = tfidf_model[word_count]
print(tfidf)
tfidf_matrix = corpus2dense(tfidf, len(dictionary))
tfidf_matrix
model = gensim.models.word2vec(text, size=100, window=5, min_count=2)    #訓練詞向量
model.wv[u'月台']

因為庫函式可能會更新，導致部分函式無法使用，所以使用時請注意時間，現在是2018/3/29

資料結構化與儲存

1.將新聞的正文內容儲存到文字檔案。soup beautifulsoup res.text,html.parser content soup.select show content 0 text f open news.txt w encoding utf 8 f.write content f.c...

資料結構化與儲存

作業是同學的，因為沒有對新聞資訊做提取，所有無法新增新聞資訊到字典。已練習pandas庫的相關使用方法，匯出excel檔案。ps 自己的會盡快修改！import requests from bs4 import beautifulsoup from datetime import datetim...

資料結構化與儲存

1.將新聞的正文內容儲存到文字檔案。newscontent soup.select show content 0 text f open news.txt w f.write newscontent f open news.txt r print f.read 3.安裝pandas，用pandas....

中文文字資料結構化處理例項

資料結構化與儲存

資料結構化與儲存

資料結構化與儲存

相關推薦