使用sklearn提取文字tfidf特徵
參考
或者:
語料庫:
corpus =
['this is the first document'
,'this is the second second document'
,'and the third one'
,'is this the first document'
]from sklearn.feature_extraction.text import tfidfvectorizer
tfidf_vec = tfidfvectorizer(
)tfidf_matrix = tfidf_vec.fit_transform(corpus)
# 得到語料庫所有不重複的詞
print
(tfidf_vec.get_feature_names())
# 得到每個單詞對應的id值
print
(tfidf_vec.vocabulary_)
# 得到每個句子所對應的向量
# 向量裡數字的順序是按照詞語的id順序來的
print
(tfidf_matrix.toarray())
[輸出]:[
'and'
,'document'
,'first'
,'is'
,'one'
,'second'
,'the'
,'third'
,'this'][
[0.0.43877674
0.54197657
0.438776740.
0.0.358728740.
0.43877674][
0.0.272301470.
0.272301470.
0.85322574
0.222624290.
0.27230147][
0.552805320.
0.0.
0.552805320.
0.28847675
0.552805320.
][0.
0.43877674
0.54197657
0.438776740.
0.0.358728740.
0.43877674
]]
python提取文字的tfidf特徵
corpus =
['this is the first document'
,'this is the second second document'
,'and the third one'
,'is this the first document'
]#對語料進行分詞
word_list =
for i in
range
(len
(corpus)):
.split(
' ')
)print
(word_list)
[輸出]:[
['this'
,'is'
,'the'
,'first'
,'document'],
['this'
,'is'
,'the'
,'second'
,'second'
,'document'],
['and'
,'the'
,'third'
,'one'],
['is'
,'this'
,'the'
,'first'
,'document']]
#統計詞頻
countlist =
for i in
range
(len
(word_list)):
count = counter(word_list[i]
)countlist
[輸出]
:[counter(),
counter(),
counter(),
counter()]
#定義計算tfidf公式的函式
# word可以通過count得到,count可以通過countlist得到
# count[word]可以得到每個單詞的詞頻, sum(count.values())得到整個句子的單詞總數
deftf
(word, count)
:return count[word]
/sum
(count.values())
# 統計的是含有該單詞的句子數
defn_containing
(word, count_list)
:return
sum(
1for count in count_list if word in count)
# len(count_list)是指句子的總數,n_containing(word, count_list)是指含有該單詞的句子的總數,加1是為了防止分母為0
defidf
(word, count_list)
:return math.log(
len(count_list)/(
1+ n_containing(word, count_list)))
# 將tf和idf相乘
deftfidf
(word, count, count_list)
:return tf(word, count)
* idf(word, count_list)
#計算每個單詞的tfidf值
import math
for i, count in
enumerate
(countlist)
:print
("top words in document {}"
.format
(i +1)
) scores =
sorted_words =
sorted
(scores.items(
), key=
lambda x: x[1]
, reverse=
true
)for word, score in sorted_words[:]
:print
("\tword: {}, tf-idf: {}"
.format
(word,
round
(score,5)
))[輸出]
:top words in document 1
word: first, tf-idf:
0.05754
word: this, tf-idf:
0.0 word:
is, tf-idf:
0.0 word: document, tf-idf:
0.0 word: the, tf-idf:
-0.04463
top words in document 2
word: second, tf-idf:
0.23105
word: this, tf-idf:
0.0 word:
is, tf-idf:
0.0 word: document, tf-idf:
0.0 word: the, tf-idf:
-0.03719
top words in document 3
word:
and, tf-idf:
0.17329
word: third, tf-idf:
0.17329
word: one, tf-idf:
0.17329
word: the, tf-idf:
-0.05579
top words in document 4
word: first, tf-idf:
0.05754
word:
is, tf-idf:
0.0 word: this, tf-idf:
0.0 word: document, tf-idf:
0.0 word: the, tf-idf:
-0.04463
TF IDF計算方法
例1 有很多不同的數學公式可以用來計算tf idf。這邊的例子以上述的數學公式來計算。詞頻 tf 是一詞語出現的次數除以該檔案的總詞語數。假如一篇檔案的總詞語數是100個,而詞語 母牛 出現了3次,那麼 母牛 一詞在該檔案中的詞頻就是3 100 0.03。乙個計算檔案頻率 idf 的方法是測定有多少...
AUC的計算方法
在機器學習的分類任務中,我們常用許多的指標,諸如召回率 recall 準確率 precision f1值 auc等。相信這個問題很多玩家都已經明白了,簡單的概括一下,auc are under curve 是乙個模型的評價指標,用於分類任務。那麼這個指標代表什麼呢?這個指標想表達的含義,簡單來說其實...
mAP的計算方法
git開源專案 比如unsky的fpn 中test net.py呼叫test.py下的def test net net,imdb,max per image 1000,thresh 0.05,vis false 函式。之後會順一遍,先介紹下思路 假設一張測試有3個標定好的ground truth 黑...