Python3求英文文件中每個單詞出現的次數並排序

[本文出自天外歸雲的]

題目要求：

1、統計英文文件中每個單詞出現的次數。

2、統計結果先按次數降序排序，再按單詞首字母降序排序。

3、需要考慮大檔案的讀取。

我的解法如下：

import
chardet
importre#
大檔案讀取生成器
def read_big_file(f_path, chunk_size=100):
f = open(f_path, 'rb'
)    
while
true:
#每次讀取指定記憶體大小的內容
chunk_data =f.read(chunk_size)
ifnot
chunk_data:
break
#獲取檔案編碼並返回解碼後的字串
detect =chardet.detect(chunk_data)
#print(f'檔案編碼：')
yield chunk_data.decode(detect["
encoding"])
#pythonic大檔案讀取生成器
defread_big_file_pythonic(f_path):
with open(f_path, "rb
") as f:
for line in
f.readlines():
yield
line.decode()
#設定分詞符並用字典統計單詞出現次數
def words_freq(data, freq={}):
for word in re.split('
[,. ]
', data):
if word in
freq:
freq[word] += 1
elif word != ""
:            freq[word] = 1
return
freq
if__name__ == '
__main__':
f_path = "
en_text.txt
"freq ={}
for i in
read_big_file_pythonic(f_path):
freq =words_freq(i, freq)
print(sorted(freq.items(), key=lambda x: (x[1], x[0]), reverse=true))

其中read_big_file方法存在的問題：按大小進行檔案讀取可能會在邊界處將乙個單詞拆分為兩個單詞，目前沒找到什麼好辦法解決。

其中read_big_file_pythonic方法存在的問題：按行迭代讀取，如果大檔案只有一行就不好了。

所以要看實際情況合理選擇兩種方法的使用。

python3 怎麼統計英文文件常用詞？（附解釋）

coding utf 8 in 32 import requests from bs4 import beautifulsoup res requests.get res.encoding utf 8 soup beautifulsoup res.text,lxml in 66 speech new...

Python3 中文檔案讀寫

字串在python內部的表示是unicode編碼，因此，在做編碼轉換時，通常需要以unicode作為中間編碼，即先將其他編碼的字串解碼 decode 成unicode，再從unicode編碼 encode 成另一種編碼。在新版本的python3中，取消了unicode型別，代替它的是使用unicod...

python3中異常處理 Python3異常處理

python的異常處理機制使用 try.except 捕獲異常 try 業務實現 except error1,error2,as e 出現異常後的處理異常類的繼承關係 baseexception systemexit keyboardinterrupt generatorexit excepti...

Python3求英文文件中每個單詞出現的次數並排序

python3 怎麼統計英文文件常用詞？（附解釋）

Python3 中文檔案讀寫

python3中異常處理 Python3異常處理

相關推薦