合併txt檔案，並使用jieba分詞

將根目錄下，各個類別檔案內所有txt檔案分詞、去除停用詞後集合為乙個txt檔案

#-*- coding: utf-8 -*- 
import os
import jieba
# 遍歷指定目錄，顯示目錄下的所有檔名
defeachfile
(filepath):
fr = open('stopwords.txt','r')  #停用詞檔案在當前工作目錄下
stopwords_list =
for line in fr.readlines():
line=line.decode('utf-8').strip().split()
#print line,type(line),len(line)
line=line[0]
#print line,type(line),len(line)
pathdir =  os.listdir(filepath)
dat=
for alldir in pathdir:
dat=''
child = os.path.join('%s%s\\' % (filepath, alldir))
wfile='d:\documents\data\redeced1\\'+alldir+'.txt'
fopen = open(wfile, 'w')
print child# .decode('gbk')是解決中文顯示亂碼問題
for x in os.listdir(child):
print x
fr=open(child+x,'r').readlines()
for y in fr:
y=y.strip('\n')
seg_list =list(jieba.cut(y))
outstr = ''
for word in seg_list:
if  word not
in stopwords_list :
outstr += word  
outstr += '  ' 
dat=dat+outstr
#print dat
fopen.write(dat.encode('gbk','ignore')+'\n')
break
fopen.close()
if __name__ == '__main__':
filepath = "d:\\documents\\data\\reduced\\"
#檔案所在目錄
eachfile(filepath)

Python 合併多個TXT檔案並統計詞頻的實現

邏輯很清晰簡單，不算難，使用 python 讀取多個 txt 檔案，將檔案的內容寫入新的 txt 中，然後對新 txt 檔案進行詞頻統計，得到最終結果。程式設計客棧如下在windows 10，python 3.7.4環境下執行通過 coding utf 8 import re import os...

利用Jieba對txt進行分詞操作並儲存在資料庫中

前提已經安裝好jieba和pymysql jieba庫的安裝與使用 pymysql的安裝與使用 1 建立python專案 idea怎麼建立python專案 2 pom.xml中加入依賴 org.pythongroupid jython standaloneartifactid 2.7.0versi...

使用Spark Core匯入txt檔案並格式化輸出

目標檔案loudacre.txt 檔案存於我的linux訓練機上面的 home training documents loudacre.txt 檔案的每行用乙個 tab 分割匯入檔案 var trainingrecord sc.textfile file home training documen...

合併txt檔案，並使用jieba分詞

Python 合併多個TXT檔案並統計詞頻的實現

利用Jieba對txt進行分詞操作並儲存在資料庫中

使用Spark Core匯入txt檔案並格式化輸出

相關推薦