Python3 文章標題關鍵字提取

2021-08-31 16:10:15 字數 3559 閱讀 4107


sklearn詳見:文字特徵提取—— tf-idf項加權

import os

import jieba

import sys

from sklearn.feature_extraction.text import tfidfvectorizer


stop_words = set((

"基於", "面向", "研究", "系統", "設計", "綜述", "應用", "進展", "技術", "框架", "txt"

))def getfilelist(path):

filelist =

files = os.listdir(path)

for f in files:

if f[0] == '.':



return filelist, path

def fenci(filename, path, segpath):

# 儲存分詞結果的資料夾

if not os.path.exists(segpath):


seg_list = jieba.cut(filename)

result =

for seg in seg_list:

seg = ''.join(seg.split())

if len(seg.strip()) >= 2 and seg.lower() not in stop_words:

# 將分詞後的結果用空格隔開,儲存至本地

f = open(segpath + "/" + filename + "-seg.txt", "w+")

f.write(' '.join(result))


def tfidf(filelist, sfilepath, path, tfidfw):

corpus =

for ff in filelist:

fname = path + ff

f = open(fname + "-seg.txt", 'r+')

content =


vectorizer = tfidfvectorizer() # 該類實現詞向量化和tf-idf權重計算

tfidf = vectorizer.fit_transform(corpus)

word = vectorizer.get_feature_names()

weight = tfidf.toarray()

if not os.path.exists(sfilepath):


for i in range(len(weight)):

print('----------writing all the tf-idf in the ', i, 'file into ', sfilepath + '/', i, ".txt----------")

f = open(sfilepath + "/" + str(i) + ".txt", 'w+')

result = {}

for j in range(len(word)):

if weight[i][j] >= tfidfw:

result[word[j]] = weight[i][j]

resultsort = sorted(result.items(), key=lambda item: item[1], reverse=true)

for z in range(len(resultsort)):

f.write(resultsort[z][0] + " " + str(resultsort[z][1]) + '\r\n')

print(resultsort[z][0] + " " + str(resultsort[z][1]))


tfidfvectorizer( ) 類 實現了詞向量化和tf-idf權重的計算

using jieba on 農業大資料研究與應用進展綜述.txt

using jieba on 基於hadoop的分布式並行增量爬蟲技術研究.txt

using jieba on 基於rpa的財務共享服務中心賬表核對流程優化.txt

using jieba on 基於大資料的特徵趨勢統計系統設計.txt

using jieba on 網路大資料平台異常風險監測系統設計.txt

using jieba on 面向資料中心的多源異構資料統一訪問框架.txt

----------writing all the tf-idf in the  0 file into  ./keywords/ 0 .txt----------

農業 0.773262366783

大資料 0.634086202434

----------writing all the tf-idf in the  1 file into  ./keywords/ 1 .txt----------

hadoop 0.5

分布式 0.5

並行增量 0.5

爬蟲 0.5

----------writing all the tf-idf in the  2 file into  ./keywords/ 2 .txt----------

rpa 0.408248290464

優化 0.408248290464

服務中心 0.408248290464

流程 0.408248290464

財務共享 0.408248290464

賬表核對 0.408248290464

----------writing all the tf-idf in the  3 file into  ./keywords/ 3 .txt----------

特徵 0.521823488025

統計 0.521823488025

趨勢 0.521823488025

大資料 0.427902724969

----------writing all the tf-idf in the  4 file into  ./keywords/ 4 .txt----------

大資料平台 0.4472135955

異常 0.4472135955

監測 0.4472135955

網路 0.4472135955

風險 0.4472135955

----------writing all the tf-idf in the  5 file into  ./keywords/ 5 .txt----------

多源異構資料 0.57735026919

資料中心 0.57735026919

統一訪問 0.57735026919


