用python做詞頻統計

假設有乙個本地的txt檔案，想對其進行詞頻統計，可以這樣寫：

import time
path='c:\\users\\zhangxiaomei\\desktop\\walden.txt'
with
open(path,'r') as
text:
words=text.read().split()
print(words)
forword
inwords:
time.sleep(3)
print ('{}-{} times'.format(word,words.count(word)))

基本思路是用split函式將文章中的每個單詞分開，然後用count函式進行統計。但是這種方法有如下弊端：

1.對標點符號進行了計數

2. python識別大小寫，同乙個單詞因為大小寫原因被分別計數了。

可以做如下改進：

import string
import time
path='c:\\users\\zhangshuailing\\desktop\\walden.txt'
with
open(path,'r') as
text:
words=[raw_word.strip(string.punctuation).lower() for raw_word in
text.read().split()]
words_index=set(words)
counts_dict=
forword
in sorted(counts_dict,key=lambda x:counts_dict[x],reverse=true):
time.sleep(2)
print ('{}--{} times'.format(word,counts_dict[word]))

優化方案如下：

1.引入string，用strip函式將標點符號（string.punctuation）全部刪除掉，然後進行大小寫替換再分詞。

2.將列表用set函式轉換為集合，自動去重，求得索引

3.建立了乙個以單詞為鍵，出現次數為值的字典

4.用sorted 函式進行排序，用reverse=true進行逆序，將詞頻統計從大到小進行排序。

用Python進行詞頻統計

def gettext txt open hamlet.txt r read 讀取檔案 txt txt.lower 把文字全部變為小寫 for ch in 把特殊字元變為空格 txt txt.replace ch,return txt hamlettext gettext words hamlett...

Python 統計詞頻

calhamletv1.py def gettext txt open hamlet.txt r read txt txt.lower for ch in txt txt.replace ch,將文字中特殊字元替換為空格 return txt hamlettxt gettext words haml...

python 詞頻統計

import re 正規表示式庫 import collections 詞頻統計庫 f open text word frequency statistics.txt article f.read lower 統一轉化成小寫 f.close pattern re.compile t n articl...

用python做詞頻統計

用Python進行詞頻統計

Python 統計詞頻

python 詞頻統計

相關推薦