語料中篩選出英文單詞並統計詞頻，正則切割匹配

1.正則的使用匹配

2.dic.setdefault()的使用

3、內建函式enumerate(sequence,start=0)的使用

4、內建函式sorted(),key,reversed引數設定

5、str.lower()string大小寫轉換

#coding:utf-8
import re
import os
import time
import codecs
path = os.path.dirname(__file__)
s = u'what a beautiful world'.lower()
pattern = re.compile(u'[^a-z]+', re.u)#在非英文出進行切割
for con in pattern.split(s.lower()):#將所有英文轉化為小寫
if len(con) <= 1:
continue
else:
print con
def get_english_words():
'''過濾出語料中夾雜在漢語中的英文單詞，並統計出現的詞頻'''
eng_freq_dic = {}
pattern = re.compile(u'[^a-z]+', re.u)
cut_filename = r'e:\svn\linguistic_model\data\combine_msg_comment.txt'
with codecs.open(cut_filename, encoding='utf-8') as f:
for line in f.readlines():
for con in pattern.split(line.lower()):
if len(con) <= 1:#過濾掉單字母
continue
else:
count = eng_freq_dic.setdefault(con, 0) + 1 #若沒有該key，則儲存該key且設value其為0。若有則value加1
eng_freq_dic[con] = count#整個英文單詞及其出現的頻度
eng_filename = os.path.join(path, 'english_words_original.txt')
eng_to_write_list = sorted([(k,v) for (k, v) in eng_freq_dic.items()], key=lambda x:x[1], reverse=true)#按照詞頻的高低進行倒序排列
codecs.open(eng_filename, mode='wb', encoding='utf-8').writelines([item[0]+'\t'+str(item[1])+'\n' for item in eng_to_write_list])#詞頻為int型別，轉化為str型別以後寫入到本地檔案中
def chose_top_n():
'''篩選出top2000,並寫入到檔案'''
line_list = 
filename = os.path.join(path, 'english_words_original.txt')
with codecs.open(filename, encoding='utf-8') as f:
for index,line in enumerate(f.readlines(), start=1):#enumerate(sequence, start=0)用法，顯示可迭代序列中元素及其位置，start引數可以確定起始下標，預設情況下為0
print index, line.strip()
time.sleep(1)
if index == 2000:
top_filename = os.path.join(path, 'top_2000_english_words.txt')
codecs.open(top_filename, mode='wb', encoding='utf-8').writelines(line_list)
break

java英文單詞

platform pl tf m n.平台 standard edition標準版 enterprise ent praiz n.企業 bytecode n.位元組碼，位元組 verifier n.檢驗機 modifier m difai n.修飾語 attribute tribju t vt.to...

python 統計英文單詞

import sys,os,re def count words text num char text re.sub w text number text re.sub 0 9 text shrink whitespace text re.sub s text return text.count d...

英文單詞詞頻統計

英文單詞詞頻統計問題描述做乙個詞頻統計程式，該程式具有以下功能 1 可匯入任意英文文字檔案 2 統計該英文檔案中單詞數和各單詞出現的頻率次數並能將單詞按字典順序輸出。3 將單詞及頻率寫入檔案。本次英文單詞的詞頻統計程式的設計過程中，使用了檔案的相關操作檔案的讀與寫在檔案中錄入資料，程式以...

語料中篩選出英文單詞並統計詞頻，正則切割匹配

java英文單詞

python 統計 英文 單詞

英文單詞詞頻統計

相關推薦

python 統計英文單詞