機器學習 TF IDF是什麼

在資訊檢索與文字挖掘中經常遇見單詞的 tf-idf (term frequency - inverse document frequency)，這個值的大小能夠體現它在文字集合中的某乙個文件裡的重要性。

tf-idf是一種統計方法，用以評估一字詞對於乙個檔案集或乙個語料庫中的其中乙份檔案的重要程度。字詞的重要性隨著它在檔案**現的次數成正比增加，但同時會隨著它在語料庫**現的頻率成反比下降。tf-idf加權的各種形式常被搜尋引擎應用，作為檔案與使用者查詢之間相關程度的度量或評級。除了tf-idf以外，網際網路上的搜尋引擎還會使用基於鏈結分析的評級方法，以確定檔案在搜尋結果**現的順序。

舉個例子來說，有一篇100字的短文，其中「貓」這個詞出現了3 次。那麼這篇短文中「貓」的詞頻

如果這裡有 10000000 篇文章，其中有「貓」這個詞的卻文章只有 1000個，那麼「貓」對應所有文字，也就是整個語料庫的逆向檔案頻率

這裡 log

loglo

g取 10為底。這樣就可以計算得到「貓」在這篇文章中的

現在假設在同一篇文章中，「是」這個詞出現了20次，因此「是」這個字的詞頻為0.2。如果只計算詞頻的話，在這篇文章中明顯「是」是比「貓」重要的。

但我們還有逆向檔案頻率，了解到「是」這個字在全部的 10000000 篇文章都出現過了（這樣假設可以嗎？），那麼「是」的逆向檔案頻率就是

這樣綜合下來，「是」這個字的 tf-idf 就只有 0了，遠不及「貓」重要。

這樣在計算 tf-idf 就可以知道，對於這篇文章，「貓」這個詞遠比出現更多次的「是」重要。諸如此類出現很多次，但實際上並不包含文章特徵資訊的詞還有很多，比如「這」，「也」，「就」，「是」，「的」，「了」。

那麼關於 tf-idf 的解釋，這也就是的了。

from sklearn.feature_extraction.text import countvectorizer, tfidfvectorizer
import numpy as np
deftest_countvectorizer()
:# 1. give a ****** dataset
******_train =
["don't call you tonight"
,"call me isn't a cab"
,'please call me please!'
]# 2. to conduct a countvectorizer object
cv = countvectorizer(
)#cv = countvectorizer()
cv.fit(******_train)
# to print the vocabulary of the ******_train
print
(cv.vocabulary_)
# 3. 4.transform training data into a 'document-term matrix' (which is a sparse matrix) use 「transform()」
train_data = cv.transform(******_train)
# (the index of the list , the index of the dict ) the frequency of the list[index]
print
(cv.get_feature_names())
print
(train_data)
train_data = train_data.toarray(
)print
(train_data)
# 7. transform testing data into a document-term matrix (using existing vocabulary)
******_test =
["please don't call me"
]    test_data = cv.transform(******_test)
.toarray(
)# 8. examine the vocabulary and document-term matrix together
print
(test_data)
deftest_tfidf_filter_vec()
:    ******_train =
['call you tonight'
,'call me a cab'
,'please call me... please!'
]    cv = tfidfvectorizer(
)    cv.fit(******_train)
print
(cv.vocabulary_)
train_data = cv.transform(******_train)
print
(cv.get_feature_names())
print
(train_data)
train_data = train_data.toarray(
)print
(train_data)
if __name__ ==
'__main__'
:#test_countvectorizer()
test_tfidf_filter_vec(
)

test_tfidf_filter_vec方法執行的結果

['cab'
,'call'
,'me'
,'please'
,'tonight'
,'you'](
0,5)
0.652490884512534(0
,4)0.652490884512534(0
,1)0.3853716274664007(1
,2)0.5478321549274363(1
,1)0.4254405389711991(1
,0)0.7203334490549893(2
,3)0.901008145286396(2
,2)0.3426199591918006(2
,1)0.2660749625405929[[
0.0.385371630.
0.0.65249088
0.65249088][
0.72033345
0.42544054
0.547832150.
0.0.
][0.
0.26607496
0.34261996
0.901008150.
0.]]
process finished with exit code 0

test_countvectorizer方法執行的結果

['cab'
,'call'
,'don'
,'isn'
,'me'
,'please'
,'tonight'
,'you'](
0,1)
1(0,
2)1(
0,6)
1(0,
7)1(
1,0)
1(1,
1)1(
1,3)
1(1,
4)1(
2,1)
1(2,
4)1(
2,5)
2[[0
1100
011]
[110
1100
0][0
1001
200]
][[0
1101
100]
]process finished with exit code 0

機器學習 TF IDF是什麼

TF IDF是什麼學習筆記（基礎版）

機器學習是什麼？

機器學習是什麼

機器學習 TF IDF是什麼

TF IDF是什麼學習筆記（基礎版）

機器學習是什麼？

機器學習是什麼

相關推薦