特徵工程 TFIDF提取特徵

本文介紹文字處理時比較常用且有效的tfidf特徵提取方法

tf即是詞頻(term frequency)是文字資訊量統計方法之一，簡單來說就是統計此文字中每個詞的出現頻率

def
computetf
(worddict, bow):
tfdict = {}
bowcount = len(bow)
for word, count in worddict.items():
tfdict[word] = count / float(bowcount)
return tfdict

idf即逆向文件頻率(inverse document frequency)，用來衡量乙個詞的普遍重要性，一般通過文件總數/包含該詞彙的文件數，再取對數得到的值

def
computeidf
(doclist):
import math 
idfdict = {}
n = len(doclist)
idfdict = dict.fromkeys(doclist[0].keys(), 0)
for doc in doclist:
for word, val in doc.items():
if word in idfdict:
if val > 0:
idfdict[word] += 1
else:
if val > 0:
idfdict[word] = 1
for word, val in idfdict.items():
idfdict[word] = math.log10(n / float(val))
return idfdict

tf-idf即是tf * idf所得到的值，可以衡量某個詞在所有文件中的資訊量。假設有n個詞的文件a，某個詞的出現次數為t，且該詞在w份文件中出現過，總共有x份檔案

def
computetfidf
(tfbow, idfs):
tfidf = {}
for word, val in tfbow.items():
tfidf[word] = val * idfs[word]
return tfidf

from sklearn.feature_extraction.text import tfidfvectorizer
count_vec = tfidfvectorizer(binary=false, decode_error='ignore', stop_words='english')

傳入資料進行擬合然後轉化為詞向量的形式

s1 = 'i love you so much'
s2 = 'i hate you! ****!'
s3 = 'i like you, but just like you'
response = count_vec.fit_transform([s1, s2, s3]) # s must be string
print(count_vec.get_feature_names())
print(response.toarray())

輸出去掉英文停用詞後的結果如下

[『****』, 『hate』, 『just』, 『like』, 『love』]

[[0. 0. 0. 0. 1. ]

[0.70710678 0.70710678 0. 0. 0. ]

[0. 0. 0.4472136 0.89442719 0. ]]

參考部落格：

關於tf(詞頻) 和tf-idf(詞頻-逆向檔案頻率 )的理解

特徵工程特徵提取

特徵提取將任意資料如文字或影象轉換為可用於機器學習的數字特徵注特徵值化是為了計算機更好的去理解資料字典特徵提取作用對字典資料進行特徵值化 dictvectorizer.get feature names 返回類別名稱 from sklearn.feature extraction i...

Spark特徵提取 TF IDF

詞頻 term frequency，縮寫為tf 在一篇文件中出現次數最多的詞是的是在這一類最常用的詞。它們叫做停用詞 stop words 表示對找到結果毫無幫助必須過濾掉的詞。還有長度小於2大於10的，數字也過濾掉，根據詞性過濾，留下有實際意義的詞。用統計學語言表達，就是在詞頻的基礎上...

特徵提取方法 one hot 和 TF IDF

one hot 和 tf idf是目前最為常見的用於提取文字特徵的方法，本文主要介紹兩種方法的思想以及優缺點。什麼是one hot編碼？one hot編碼，又稱獨熱編碼一位有效編碼。其方法是使用n位狀態暫存器來對n個狀態進行編碼，每個狀態都有它獨立的暫存器位，並且在任意時候，其中只有一位有效。舉個...

特徵工程 TFIDF提取特徵

特徵工程 特徵提取

Spark特徵提取 TF IDF

特徵提取方法 one hot 和 TF IDF

相關推薦

特徵工程特徵提取