資料探勘專案（2）特徵選擇

此次進行特徵選擇的資料還是上次用到的金融資料並**貸款使用者是否會逾期。此次資料為了排除缺失值對資料的影響，將所有缺失的樣本資料進行刪除，並刪除了幾個對資料分類無影響的變數，最後保留1534×86的資料量。

1.刪除方差較小的特徵，也就是所謂的不發散的特徵，因為這些特徵在所有的資料中變化量很小或者幾乎沒有變化，那麼可以認為這個變數對分類的結果不會產生太大的影響。

import pandas as pd
from sklearn.feature_selection import variancethreshold
data = pd.read_csv('data.csv')
label = data['status']#get label
data.drop('status',axis = 1, inplace = true)
names = data.columns

2.iv值進行特徵選擇

對於資料集中的變數來說，其蘊含的資訊越多，那麼對於分類結果的貢獻就越大，資訊價值就越大，也就是說此變數的iv(information value）就越大。對於iv值的具體計算可以參考：

from sklearn.ensemble import randomforestclassifier
from sklearn.model_selection import stratifiedkfold
def calciv(x,y):
n_0=np.sum(y==0)
n_1=np.sum(y==1)
n_0_group=np.zeros(np.unique(x).shape)
n_1_group=np.zeros(np.unique(x).shape)
for i in range(len(np.unique(x))):
n_0_group[i] = y[(x==np.unique(x)[i])&(y==0)].count()
n_1_group[i] = y[(x==np.unique(x)[i])&(y==1)].count()
iv = np.sum((n_0_group/n_0-n_1_group/n_1)*np.log((n_0_group/n_0)/(n_1_group/n_1)))
if iv>=1.0:## 處理極端值
iv=1
return iv
def caliv_batch(data,y):
ivlist=
for col in data.columns:
iv=calciv(data[col],y)
names=list(data.columns)
iv_df=pd.dataframe(,columns=['var','iv'])
return iv_df,ivlist
im_iv, ivl = caliv_batch(data,label)
threshold = 0.02
threshold2 = 0.6
data_index=
for i in range(len(ivl)):
if (im_iv['iv'][i]< threshold)|(im_iv['iv'][i] > threshold2):
data.drop(data_index,axis=1,inplace=true)
skf = stratifiedkfold(n_splits=10)#10折交叉
ac = 
for train_index, test_index in skf.split(data,label):
x_train, x_test = data.iloc[train_index,:], data.iloc[test_index,:]
y_train, y_test = label.iloc[train_index],label.iloc[test_index]
forest = randomforestclassifier(n_estimators=100, random_state=0, n_jobs=-1)  
forest.fit(x_train, y_train) #training
pre_rf = forest.predict(x_test)
ac_train = forest.score(x_train,y_train)
print('the accuracy rf is : '.format(np.mean(ac)))

最後資料集中只剩下3個特徵['top_trans_count_last_1_month', 'rank_trad_1_month', 'regional_mobility']

分類的正確率為：0.7877，比用所有特徵得到的結果要好。

3.用隨機森林的方法來對其進行特徵選擇，這裡主要是根據特徵的重要性來選擇的特徵，屬於filter類的方法

from sklearn.ensemble import randomforestclassifier
from sklearn.model_selection import stratifiedkfold
forest = randomforestclassifier(n_estimators=100, random_state=0,n_jobs=1)
forest.fit(data, label)
importance = forest.feature_importances_
imp_result = np.argsort(importance)[::-1]
columns = 
for i in range(data.shape[1]):
print("%2d. %-*s %f"%(i+1, 30, names[i], importance[imp_result[i]]))
columns = columns[0:44]    
data1 = data[columns]
def rf_classify(train_data, train_label, test_data, test_label):
forest = randomforestclassifier(n_estimators=15, random_state=0, n_jobs=-1)  
forest.fit(train_data, train_label) #training
#    pre_rf = forest.predict(x_test)
ac_train = forest.score(test_data, test_label)
return ac_train
skf = stratifiedkfold(n_splits=10)#10折交叉
ac = 
for train_index, test_index in skf.split(data1,label):
x_train, x_test = data1.iloc[train_index,:], data1.iloc[test_index,:]
y_train, y_test = label.iloc[train_index],label.iloc[test_index]
acc = rf_classify(x_train, y_train, x_test, y_test)
#    acc = lightgbm_classify(x_train, y_train, x_test, y_test)
print('the accuracy rf is : '.format(np.mean(ac)))

4.除了filter類的方法外，我們還可以用遺傳演算法對特徵程序編碼，然後經過選擇、交叉、變異最後選擇出特徵

**後續再補充

from sklearn.feature_selection import rfe
data_=rfe(estimator=randomforestclassifier(), n_features_to_select=10).fit_transform(data,label)

用隨機森林分類的準確率為0.7764

6、以上的選擇方法都是在完整的資料集上進行特徵選擇，但有的時候我們擔心缺失值的填補會對資料分布造成影響，此時可以用決策樹的方法來進行特徵選擇。在計算基尼指數或者資訊熵的時候按照比例進行計算，最後按照特徵的重要性排序選擇特徵。**後續補充。

由於對此資料不是很熟悉，不太確定要怎麼進行特徵衍生，所以這部分後續進行補充。

資料探勘專案（2）特徵選擇

資料探勘專案特徵選擇

原文字挖掘特徵選擇

資料探勘 ctr特徵

資料探勘專案（2）特徵選擇

資料探勘專案 特徵選擇

原 文字挖掘 特徵選擇

資料探勘 ctr特徵

相關推薦

資料探勘專案特徵選擇

原文字挖掘特徵選擇