特徵選擇 SelectKBest

官網：sklearn.feature_selection.selectkbest

根據給定的選擇器

選擇出前k個與標籤最相關的特徵。

class
sklearn
.feature_selection.selectkbest(score_func=
,*, k=
10)

引數說明：

parameters-- ---- ---- score_func: 可呼叫的函式輸入兩個陣列x和y，並返回一對陣列（分數，p-value）或帶分數的單個陣列。預設值為f_classif（請參見下文「另請參見」）。預設功能僅適用於分類任務。 k：int or 「all」, optional, 預設= 10 要選擇的主要特徵數（保留前幾個最佳特徵）。「 all」選項繞過選擇，用於引數搜尋。 attributes-- ---- ---- scores_：array-like of shape (n_features, ) 特徵分數。 pvalues_：array-like of shape (n_features,

) 特徵分數的p值，如果score_func僅返回分數，則為none 。

官網示例：

使用 chi2方法進行篩選特徵

從64個特徵中保留了20個最佳特徵

>>
>
from sklearn.datasets import load_digits
>>
>
from sklearn.feature_selection import selectkbest, chi2
>>
> x, y = load_digits(return_x_y=
true
)>>
> x.shape
(1797,64
)>>
> x_new = selectkbest(chi2, k=20)
.fit_transform(x, y)
>>
> x_new.shape
(1797,20
)

方法

'fit(self, x, y)' 在（x，y）上執行得分功能並獲得適當的功能。相當於訓練模型吧。 parameters x：array-like of shape (n_samples, n_features) 輸入訓練樣本。 y：array-like of shape (n_samples, ) 目標值（分類中的類標籤，回歸中的實數）。 returns self：object 'fit_transform(self, x[, y])' 用x訓練模型，然後對其進行轉換。 parameters x： of shape (n_samples, n_features) y：ndarray of shape (n_samples, ), default= none 目標值。 **fit_paramsdict：其他擬合引數。 returns x_new：ndarray array of shape (n_samples, n_features_new) 轉換後的陣列。 'get_params(self[, deep])' 獲取此估計量的引數。 parameters deep: bool , default= true 如果為true，則將返回此估算器和作為估算器的包含子物件的引數。 returns 引數名稱對映到其值。 'get_support(self[, indices])' 獲取所選特徵的掩碼或整數索引 get a mask, or integer index, of the features selected 'inverse_transform(self, x)' 反向轉換操作 parameters x: array of shape [n_samples, n_selected_features] 輸入樣本。 returns x_r: array of shape [n_samples, n_original_features] 在x中被transform刪除的特徵位置加入零列。 'set_params(self, \*\*params)' 設定此估算器的引數。 parameters **params: dict 估算器引數. returns self: object 估算器例項。 'transform(self, x)' 將x縮小為指定的k個特徵。 parameters x: array of shape [n_samples, n_features] 輸入樣本。 returns x_new: array of shape [n_samples, n_selected_features]

僅保留k個最優特徵的輸入樣本。

會使用到：

scipy.stats.pearsonr - 皮爾森相關係數

sklearn.feature_selection.f_regression - 單變數線性回歸測試資料

from sklearn.datasets import make_regression
x,y = make_regression(n_samples=
1000
, n_features=
3, n_informative=
1, noise=
100, random_state=
9527
)

分別計算每個特徵與標籤的相關係數

from scipy.stats import pearsonr
p1 = pearsonr(x[:,
0],y)p2 = pearsonr(x[:,
1],y)p3 = pearsonr(x[:,
2],y)

可以看出x的第二個特徵是重要特徵（相關係數最高）

print
(p1)
>>
>
(0.01293680050695129
,0.6828310401786694
)print
(p2)
>>
>
(0.6680920624164118
,2.8345376164035335e-130
)print
(p3)
>>
>
(0.03938982451397195
,0.21330062660673496
)

這裡使用的選擇器為 f_regression

且只保留乙個最佳（與結果相關係數最高的）特徵

from sklearn import feature_selection as fs
best = fs.selectkbest(score_func=fs.f_regression, k=1)
best.fit(x,y)
>>
>selectkbest(k=
1, score_func=
>
)

best.scores_
>>
>array(
[1.67054044e-01
,8.04573104e+02
,1.55086141e+00
])

x_new乙個1維資料

x_new = best.transform(x)
x_new.shape
>>
>
(1000,1
)

通過下邊可以看出，保留了第2個特徵；

與上邊計算的皮爾斯相關係數對應，且滿足我們資料特性。

x_new[:5
]>>
>array([[
-0.6872076],
[2.31728703],
[-1.51368674],
[-0.24881066],
[0.39894185]]
)

x[:5
]>>
>array([[
-1.25468454,-
0.6872076,-
0.60472765],
[-0.15208513
,2.31728703,-
1.44588579],
[0.3246091,-
1.51368674
,0.01249735],
[-0.43843568,-
0.24881066
,0.77710434],
[-0.36040751
,0.39894185,-
0.61578169]]
)

特徵選擇 SelectKBest

特徵選擇單變數特徵選擇

特徵工程之特徵選擇

特徵工程之特徵選擇

特徵選擇 SelectKBest

特徵選擇 單變數特徵選擇

特徵工程之特徵選擇

特徵工程之特徵選擇

相關推薦

特徵選擇單變數特徵選擇