1. 詞頻統計:
1結果是:import
jieba
2 txt = open("
threekingdoms3.txt
", "
r", encoding='
utf-8
').read()
3 words =jieba.lcut(txt)
4 counts ={}
5for word in
words:
6if len(word) == 1:
7continue
8else
:9 counts[word] = counts.get(word,0) + 1
10 items =list(counts.items())
11 items.sort(key=lambda x:x[1], reverse=true)
12for i in range(15):
13 word, count =items[i]
14print ("
".format(word, count))
曹操 946
孔明 737
將軍 622
玄德 585
卻說 534
關公 509
荊州 413
二人 410
丞相 405
玄德曰 390
不可 387
孔明曰 374
張飛 358
如此 320
不能 318
進一步改進, 我想只知道人物出場統計,**如下:
1執行結果為:import
jieba
2 txt = open("
threekingdoms3.txt
", "
r", encoding='
utf-8
').read()
3 names =
4 words =jieba.lcut(txt)
5 counts ={}
6for word in
words:
7if len(word) == 1:
8continue
9elif word == "
諸葛亮"
or word == "
孔明曰"
:10 rword = "孔明"
11elif word == "關公"
or word == "雲長"
:12 rword = "關羽"
13elif word == "玄德"
or word == "
玄德曰"
:14 rword = "劉備"
15elif word == "孟德"
or word == "丞相"
:16 rword = "曹操"
17else
:18 rword =word
19 counts[rword] = counts.get(rword,0) + 120#
for word in excludes:21#
del counts[word]
22 items =list(counts.items())
23 items.sort(key=lambda x:x[1], reverse=true)
24for i in range(40):
25 word, count =items[i]
26if word in
names:
27print ("
".format(word, count))
曹操 1358
孔明 1265
劉備 1251
關羽 783
張飛 358
呂布 300
趙雲 278
孫權 257
周瑜 217
袁紹 191
進一步的做詞雲圖:
名稱是可以進一步優化的,參見第二部分**。
中文wordcloud庫缺省會出現亂碼,解決方法參考
參考:
Python 中文檔案統計詞頻 中文詞云
1.詞頻統計 1 import jieba 2 txt open threekingdoms3.txt r encoding utf 8 read 3 words jieba.lcut txt 4 counts 5for word in words 6if len word 1 7continue ...
用python統計英文文章詞頻
import re with open text.txt as f 讀取檔案中的字串 txt f.read 去除字串中的標點 數字等 txt re.sub d s txt 替換換行符,大小寫轉換,拆分成單詞列表 word list txt.replace n replace lower split ...
Python中文詞頻統計
以下是關於 的中文詞頻統計 這裡有三個檔案,分別為novel.txt punctuation.txt meaningless.txt。這三個是 文字 特殊符號和無意義詞 python 統計詞頻如下 import jieba jieba中文分詞庫 從檔案讀入 with open novel.txt r...