python中jieba庫的使用

英語中我們可以通過.split()對字串進行分割，從而獲取到單詞的列表。

比如如下**對哈姆雷特中前10英文單詞頻率進行了統計排序

#calhamletv1.py
def gettext():
txt = open("word frequency/hamlet.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_『~':
txt = txt.replace(ch, " ")   #將文字中特殊字元替換為空格
return txt
hamlettxt = gettext()
words  = hamlettxt.split()
counts = {}
for word in words:			
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=true) 
for i in range(10):
word, count = items[i]
print ("".format(word, count))

執行看下效果，發現the出現次數最多

那麼在中文中，因為每個漢字連在一起，沒有空格，所以就不適用了。我們可以引用python第三庫中的jieba庫來解決這個問題。

jieba是優秀的中文分詞第三方庫，它能夠對中文文字進行分詞或得單個的詞語，因為是第三方庫，所以需要額外安裝。命令：pip install jieba

jieba分詞依靠的是乙個強大的中文庫，確定漢字自檢的關聯概率，概率大的組成片語，形成了分詞的結果。除了分詞呢，使用者還可以新增自定義的片語。

jieba分詞有三種模式：

精確模式：把文字精確的切分開，不存在冗餘單詞

全模式：把文字中所有可能的詞語都掃瞄出來，有冗餘

搜尋引擎模式：在精確模式基礎上，對長詞再次切分

jieba庫

函式描述

jieba.lcut(s)

精確模式，返回乙個列表型別的分詞結果

jieba.lcut(s,cut_all=true)

全模式，返回乙個列表型別的分詞結果，存在冗餘

jieba.luct_for_search(s)

搜尋引擎模式，返回乙個列表型別的分詞結果，存在冗餘

jieba.add_world(w)

向分詞詞典增加新詞w

利用jieba統計三國演義人名出線頻率

#calthreekingdomsv2.py
import jieba
excludes = 
txt = open("word frequency/threekingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "諸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "關公" or word == "雲長":
rword = "關羽"
elif word == "玄德" or word == "玄德曰":
rword = "劉備"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=true) 
for i in range(10):
word, count = items[i]
print ("".format(word, count))

執行看下效果：

Python初學13 jieba庫簡介與使用

目錄一 jieba庫基本介紹二 jieba庫的使用說明三種模式 lcut lcut for search add word 三文字詞頻統計例項簡單說，jieba是乙個非常好用的中文工具，以分詞起家，但功能比分詞要強大很多。而且，全國計算機等級考試二級python語言必考jieba庫。ji...

python中jieba庫的介紹和應用

一安裝環境 window python 二安裝方式在電腦命令符 cmd 中直接寫進下面的語句 pip install jieba 即可三 jieba庫分詞的基本原理 1 利用中文詞庫，分析漢字與漢字之間的關聯機率 2 還有分析漢字片語的關聯機率 3 還可以根據使用者自定義的片語進行分析四 ...

mysql呼叫jieba庫 jieba庫的使用

jieba是優秀的中文分詞第三方庫 jieba有3種模式 1.精確模式，返回乙個列表型別的分詞結果 jieba.lcut 中國是乙個偉大的國家中國是乙個偉大的國家 2.全模式，返回乙個列表型別的分詞結果，存在冗餘 jieba.lcut 中國是乙個偉大的國家 cut all true 中國...

python中jieba庫的使用

Python初學13 jieba庫簡介與使用

python中jieba庫的介紹和應用

mysql呼叫jieba庫 jieba庫的使用

相關推薦