結巴分詞詳見:結巴分詞github
sklearn詳見:文字特徵提取——4.2.3.4 tf-idf項加權
import os
import jieba
import sys
from sklearn.feature_extraction.text import tfidfvectorizer
jieba.load_userdict('userdicttest.txt')
stop_words = set((
"基於", "面向", "研究", "系統", "設計", "綜述", "應用", "進展", "技術", "框架", "txt"
))def getfilelist(path):
filelist =
files = os.listdir(path)
for f in files:
if f[0] == '.':
pass
else:
return filelist, path
def fenci(filename, path, segpath):
# 儲存分詞結果的資料夾
if not os.path.exists(segpath):
os.mkdir(segpath)
seg_list = jieba.cut(filename)
result =
for seg in seg_list:
seg = ''.join(seg.split())
if len(seg.strip()) >= 2 and seg.lower() not in stop_words:
# 將分詞後的結果用空格隔開,儲存至本地
f = open(segpath + "/" + filename + "-seg.txt", "w+")
f.write(' '.join(result))
f.close()
def tfidf(filelist, sfilepath, path, tfidfw):
corpus =
for ff in filelist:
fname = path + ff
f = open(fname + "-seg.txt", 'r+')
content = f.read()
f.close()
vectorizer = tfidfvectorizer() # 該類實現詞向量化和tf-idf權重計算
tfidf = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()
weight = tfidf.toarray()
if not os.path.exists(sfilepath):
os.mkdir(sfilepath)
for i in range(len(weight)):
print('----------writing all the tf-idf in the ', i, 'file into ', sfilepath + '/', i, ".txt----------")
f = open(sfilepath + "/" + str(i) + ".txt", 'w+')
result = {}
for j in range(len(word)):
if weight[i][j] >= tfidfw:
result[word[j]] = weight[i][j]
resultsort = sorted(result.items(), key=lambda item: item[1], reverse=true)
for z in range(len(resultsort)):
f.write(resultsort[z][0] + " " + str(resultsort[z][1]) + '\r\n')
print(resultsort[z][0] + " " + str(resultsort[z][1]))
f.close()
tfidfvectorizer( ) 類 實現了詞向量化和tf-idf權重的計算
using jieba on 農業大資料研究與應用進展綜述.txt
using jieba on 基於hadoop的分布式並行增量爬蟲技術研究.txt
using jieba on 基於rpa的財務共享服務中心賬表核對流程優化.txt
using jieba on 基於大資料的特徵趨勢統計系統設計.txt
using jieba on 網路大資料平台異常風險監測系統設計.txt
using jieba on 面向資料中心的多源異構資料統一訪問框架.txt
----------writing all the tf-idf in the 0 file into ./keywords/ 0 .txt----------
農業 0.773262366783
大資料 0.634086202434
----------writing all the tf-idf in the 1 file into ./keywords/ 1 .txt----------
hadoop 0.5
分布式 0.5
並行增量 0.5
爬蟲 0.5
----------writing all the tf-idf in the 2 file into ./keywords/ 2 .txt----------
rpa 0.408248290464
優化 0.408248290464
服務中心 0.408248290464
流程 0.408248290464
財務共享 0.408248290464
賬表核對 0.408248290464
----------writing all the tf-idf in the 3 file into ./keywords/ 3 .txt----------
特徵 0.521823488025
統計 0.521823488025
趨勢 0.521823488025
大資料 0.427902724969
----------writing all the tf-idf in the 4 file into ./keywords/ 4 .txt----------
大資料平台 0.4472135955
異常 0.4472135955
監測 0.4472135955
網路 0.4472135955
風險 0.4472135955
----------writing all the tf-idf in the 5 file into ./keywords/ 5 .txt----------
多源異構資料 0.57735026919
資料中心 0.57735026919
統一訪問 0.57735026919
Python3爬取簡書首頁文章的標題和文章鏈結
from urllib import request from bs4 import beautifulsoup beautiful soup是乙個可以從html或xml檔案中提取結構化資料的python庫 構造標頭檔案,模擬瀏覽器訪問 url page request.request url,he...
python 3讀取檔案 Python3 檔案讀寫
python open 方法用於開啟乙個檔案,並返回檔案物件,在對檔案進行處理過程都需要使用到這個函式 1.讀取檔案 with open test json dumps.txt mode r encoding utf 8 as f seek 移動游標至指定位置 f.seek 0 read 讀取整個檔...
python3中文長度 python3獲得漢字長度
import string def str count str 找出字串中的中英文 空格 數字 標點符號個數 count en count dg count sp count zh count pu 0 for s in str 英文 if s in string.ascii letters cou...