機器學習中的特徵工程詳解

1.1 探索性資料分析（exploratory data analysis）&描述性分析（descriptive analysis）

常用的函式有（呼叫pandas包）：head()、info()、describe()、isnull()、corr()等

1.2 四種資料級別

2.1 缺失值處理

注意：缺失值的填補應該在劃分訓練集和測試集之後，即用訓練集的某一特徵均值代入測試集的缺失值。

2.2 標準化和正態化（standardization and normalization）

在做資料變換時，sklearn實現了一種名為pipelines的流水線處理模式，使得在嘗試模型引數進行交叉檢驗時，多個處理步驟可整合一體，在具體可參考官網sklearn.pipeline.pipeline的說明。以定性資料的缺失值處理為例：

from sklearn.base import transformermixin
class
customcategoryimputer
(transformermixin):
def__init__
(self, cols=none):
self.cols = cols
deftransform
(self, df):
x = df.copy()
for col in self.cols:
x[col].fillna(x[col].value_counts().index[0], inplace=true)
return x
deffit(self, *_):
return self
cci = customcategoryimputer(cols=['col1', 'col2'])
imputer = pipeline([('category', cci), ('***', ***)]) 
imputer.fit_transform(x)

3.1 類別資料編碼（encoding categorical variables）

3.2 文字特徵編碼

文字特徵的處理常採用bag-of-words的方法，基本思想是採用單詞出現的頻率來表示一段文字，一般都如下三個步驟：

具體可呼叫sklearn包的countvectorizer()和countvectorizer()

4.1 模型效能評估指標

特徵選擇的依據是模型效能的好壞，下面首先介紹分類模型的評估指標：

對於回歸問題的評估指標，常用的有：

除此以外，一些其他因素也值得考慮，如：

4.2 特徵選擇方法

（1）基於概率統計理論的特徵選擇

from sklearn.base import transformermixin, baseestimator
class
customcorrelationchooser
(transformermixin, baseestimator):
def__init__
(self, response, cols_to_keep=, threshold=none):
self.response = response
self.threshold = threshold
self.cols_to_keep = cols_to_keep
deftransform
(self, x):
return x[self.cols_to_keep]
deffit(self, x, *_):
df = pd.concat([x, self.response], axis=1)
self.cols_to_keep = df.columns[df.corr()[df.columns[-1]].abs() > self.threshold]
self.cols_to_keep = [c for c in self.cols_to_keep if c in x.columns]
return self

from sklearn.feature_selection import selectkbest
from sklearn.feature_selection import f_classif
k_best = selectkbest(f_classif, k=5)
k_best.fit_transform(x, y)
p_values = pd.dataframe().sort_values('p_value')
p_values[p_values['p_value'] < .05]

（2）基於機器學習模型的特徵選擇一般的，基於概率統計理論的特徵選擇，在特徵數量特別多的時候效果不是很好。

from sklearn.feature_selection import selectfrommodel
select_from_model = selectfrommodel(decisiontreeclassifier(), threshold=.05)
selected_x = select_from_model.fit_transform(x, y)

from sklearn.feature_selection import selectfrommodel
logistic_selector = selectfrommodel(logisticregression(penalty=l1), threshold=.05)
selected_x = logistic_selector.fit_transform(x, y)

4.3 特徵選擇經驗

pca是從特徵的協方差角度，希望主成分方向攜帶盡量多的資訊量；而lda則是在已知樣本的類標註，希望投影到新的基後使得不同的類別之間的資料點的距離更大，同一類別的資料點更緊湊。

特徵變換和特徵學習均資料特徵提起範疇，特徵變換的pca和lda方法只能處理線性變換，且新的特徵數量受限於輸入特徵數量；而特徵學習則採用深度學習的方法，可以提取更複雜的特徵。

6.1 文字特徵學習

機器學習中特徵工程總結

構造更多的特徵檢視資料列名 print data train.columns 檢視資料每列資訊數目，空置和型別 print data train.info 檢視每列統計資訊數目均值方差最小值 25 分位值 50 分位值 75 分位值和最大值 print data train.descr...

機器學習特徵工程

老師有多年的網際網路從業經驗，所以講解的過程中結合了很多任務業界的應用，更加的深入淺出。資料採集資料採集前需要明確採集哪些資料，一般的思路為哪些資料對最後的結果有幫助？資料我們能夠採集到嗎？線上實時計算的時候獲取是否快捷？舉例1 我現在要使用者對商品的下單情況，或者我要給使用者做商品推薦，那...

機器學習特徵工程

資料和特徵決定了機器學習的上限，而模型和演算法只是逼近這個上限而已。通俗的說，就是盡可能的從原始資料中獲取更多資訊，從而使得模型達到最佳。簡而言之，特徵工程是乙個把原始資料變成特徵的過程，這些特徵可以很好的描述資料，並且利用它們建立的模型在未知資料上表現效能可以達到最優。實驗結果取決於獲取的資料 ...

機器學習中的特徵工程詳解

機器學習中特徵工程總結

機器學習 特徵工程

機器學習 特徵工程

相關推薦

機器學習特徵工程

機器學習特徵工程