TF IDF解釋及應用

tf: 該詞在某篇文件中出現的頻率，tf(w,d),值越大，表明該詞在文件中的重要性越高

idf: 單詞普遍性的度量，如果該值越小，則該詞認為非常普遍，如果該值很大，則認為該詞在其他文件中很少出現，可以用該詞來進行分類。

應用:

(1) 搜尋引擎

tf-idf(q, d) = sum

(3) 找出相似文章

生成兩篇文章各自的詞頻向量；

計算兩個向量的余弦相似度，值越大就表示越相似。

tf-idf_demo

#encoding
:utf-8
#@time : 2017/8/18
1:10
#@author : jackniu
from sklearn.feature_extraction.text import  tfidfvectorizer
from sklearn.feature_extraction.text import countvectorizer
from sklearn.feature_extraction.text import tfidftransformer
vectorizer = countvectorizer()
corpus = [
'this is the first document this.',
'this is the second second document.',
'and the third one.',
'is this the first document?',
]x=vectorizer.fit_transform(corpus)
print("詞典")
print(vectorizer.vocabulary_)
# and
print("tf_idf 中的tf: ")
print(x.toarray())
print("計算idf")
transformer = tfidftransformer(smooth_idf=false)
tfidf = transformer.fit_transform(x.toarray())
print( tfidf.toarray())
print(transformer.idf_)
print("直接用tfidf計算")
vect = tfidfvectorizer()
y=vect.fit_transform(corpus)
print(y.toarray())
print(vect.idf_)

執行過程

d:\software\python\python35\python.exe d:/pycharmprojects/mlprogram/ml_program/chapter2/tf_idf_demo.py
詞典tf_idf 中的tf: 
[[0 1 1 1 0 0 1 0 2]
[0 1 0 1 0 2 1 0 1]
[1 0 0 0 1 0 1 1 0]
[0 1 1 1 0 0 1 0 1]]
計算idf
[[ 0.          0.34643788  0.45552418  0.34643788  0.          0.
0.26903992  0.          0.69287577]
[ 0.          0.24014568  0.          0.24014568  0.          0.89006176
0.18649454  0.          0.24014568]
[ 0.56115953  0.          0.          0.          0.56115953  0.
0.23515939  0.56115953  0.        ]
[ 0.          0.43306685  0.56943086  0.43306685  0.          0.
0.33631504  0.          0.43306685]]
[ 2.38629436
1.28768207
1.69314718
1.28768207
2.38629436
2.38629436
1.2.38629436
1.28768207]
直接用tfidf計算
[[ 0.          0.34934021  0.43150466  0.34934021  0.          0.
0.28560851  0.          0.69868042]
[ 0.          0.27230147  0.          0.27230147  0.          0.85322574
0.22262429  0.          0.27230147]
[ 0.55280532  0.          0.          0.          0.55280532  0.
0.28847675  0.55280532  0.        ]
[ 0.          0.43877674  0.54197657  0.43877674  0.          0.
0.35872874  0.          0.43877674]]
[ 1.91629073
1.22314355
1.51082562
1.22314355
1.91629073
1.91629073
1.1.91629073
1.22314355]

如果要進行上面的應用，則是使用tfidf模型的idf進行計算

參考:

輕鬆理解TF IDF原理及應用

在了解tf idf原理前，我們首先需要高清楚為啥需要它以及它能解決什麼問題？下面我們先從以計數為特徵的文字向量化來說起。計數特徵，簡單來講就是統計每個特徵詞在文件中出現的次數，把次數作為特徵的權重。因此在以計數特徵文字分詞並向量化後，我們可以得到詞彙表中每個詞在各個文字中形成的詞向量，比如我們將下面...

django建立應用及應用模組解釋

1 步驟 1 開啟命令列，進入專案中manage.py同級目錄 2 詳細 1 進入目錄沒有報錯，說明建立成功 3 新增應用名 3 應用模組解釋資料移植遷移模組，django自動生成 2 admin.py 當前應用的後台管理系統配置當前應用的一些配置 4 models.py 資料模型模組，建...

tf idf 原理及實踐

也就是詞頻啦，即乙個詞在文現的次數如果乙個詞越常見，那麼分母就越大，逆文件頻率就越小越接近0。分母之所以要加1，是為了避免分母為0 即所有文件都不包含該詞 log表示對得到的值取對用統計學語言表達，就是在詞頻的基礎上，要對每個詞分配乙個重要性權重這個詞越常見給予較小的權重，較少見的詞 ...

TF IDF解釋及應用

輕鬆理解TF IDF原理及應用

django建立應用及應用模組解釋

tf idf 原理及實踐

相關推薦