import os
os.chdir(r'd:\\'
)text=
text1=
''for root,dirs,files in os.walk(r'd:\綠色金融文字庫'
): for i in files:
path=os.path.join(root,i)
with open(path,'r',encoding=
'gb18030',errors=
'ignore'
) as f:
text=f.readline(
) text1=text1+text
text1=text1.replace(
' ',''
)text1=text1.replace(
'新華社記者',''
)text1=text1.replace(
'中國',''
)text1=text1.replace(
'月',''
)text1=text1.replace(
'近日',''
)text1=text1.replace(
'日',''
)text1=text1.replace(
'年',''
)text1=text1.replace(
'中',''
)text1=text1.replace(
'\n',''
)
stopwords =
.fromkeys(
[line.rstrip(
)for line in open(r'd:\stopword.txt',encoding=
'utf-8',errors=
'ignore')]
)
對文字分詞並去除停用詞:
import jieba
strings=jieba.cut(text1)
str=
''for i in strings:
if i not in stopwords:
str+=i
獲得去除停用詞後的分詞文字:
str1=jieba.cut(str)
接下來就是詞頻統計和詞云繪製,詞頻統計需要引入collections包
import collections
word_counts = collections.counter(str1)
# 對分詞做詞頻統計
word_counts_top50 = word_counts.most_common(50)
# 獲取排名前50的詞
最後生成詞云,這裡需要引入wordcloud包:
from wordcloud import wordcloud
wc= wordcloud(background_color =
"black",max_words = 300,font_path=
'c:/windows/fonts/simkai.ttf',min_font_size = 15,max_font_size = 50,width = 600,height = 600)
wc.generate_from_frequencies(word_counts)
wc.to_file(
"wordcoud.png"
)
中文詞頻統計與詞云生成
中文詞頻統計 追風箏的人 txt 2.從檔案讀取待分析文字。3.安裝並使用jieba進行中文分詞。pip install jieba import jieba jieba.lcut text 4.更新詞庫,加入所分析物件的專業詞彙。jieba.add word 天罡北斗陣 逐個新增 jieba.lo...
中文詞頻統計與詞云生成
中文詞頻統計 作業連線 2.從檔案讀取待分析文字。3.安裝並使用jieba進行中文分詞。pip install jieba import jieba jieba.lcut text 4.更新詞庫,加入所分析物件的專業詞彙。jieba.add word 天罡北斗陣 逐個新增 jieba.load us...
中文詞頻統計與詞云生成
2.從檔案讀取待分析文字。3.安裝並使用jieba進行中文分詞。pip install jieba import jieba jieba.lcut text 4.更新詞庫,加入所分析物件的專業詞彙。jieba.add word 天罡北斗陣 逐個新增 jieba.load userdict word ...