資料探勘專案（二）

【特徵工程（2天）】

目標：對資料特徵進行衍生和進行特徵挑選。

包括但不限於：特徵衍生，特徵挑選。

分別用iv值和隨機森林等進行特徵選擇……以及你能想到特徵工程處理。

特徵選擇( feature selection )也稱特徵子集選擇( feature subset selection , fss )，或屬性選擇( attribute selection )。是指從已有的m個特徵(feature)中選擇n個特徵使得系統的特定指標最優化，是從原始特徵中選擇出一些最有效特徵以降低資料集維度的過程,是提高學習演算法效能的乙個重要手段。

iv值進行特徵選擇

import math
import numpy as np
from scipy import stats
from sklearn.utils.multiclass import type_of_target
def woe(x, y, event=1):  
res_woe = 
iv_dict = {}
for feature in x.columns:
x = x[feature].values
# 1) 連續特徵離散化
if type_of_target(x) == 'continuous':
x = discrete(x)
# 2) 計算該特徵的woe和iv
# woe_dict, iv = woe_single_x(x, y, feature, event)
woe_dict, iv = woe_single_x(x, y, feature, event)
iv_dict[feature] = iv
return iv_dict
def discrete(x):
# 使用5等分離散化特徵
res = np.zeros(x.shape)
for i in range(5):
point1 = stats.scoreatpercentile(x, i * 20)
point2 = stats.scoreatpercentile(x, (i + 1) * 20)
x1 = x[np.where((x >= point1) & (x <= point2))]
mask = np.in1d(x, x1)
res[mask] = i + 1    # 將[i, i+1]塊內的值標記成i+1
return res
def woe_single_x(x, y, feature,event = 1):
# event代表**正例的標籤
event_total = sum(y == event)
non_event_total = y.shape[-1] - event_total
iv = 0
woe_dict = {}
for x1 in set(x):    # 遍歷各個塊
y1 = y.reindex(np.where(x == x1)[0])
event_count = sum(y1 == event)
non_event_count = y1.shape[-1] - event_count
rate_event = event_count / event_total    
rate_non_event = non_event_count / non_event_total
if rate_event == 0:
rate_event = 0.0001
# woei = -20
elif rate_non_event == 0:
rate_non_event = 0.0001
# woei = 20
woei = math.log(rate_event / rate_non_event)
woe_dict[x1] = woei
iv += (rate_event - rate_non_event) * woei
return woe_dict, iv

import warnings
warnings.filterwarnings("ignore")
iv_dict = woe(x_train, y_train)
iv = sorted(iv_dict.items(), key = lambda x:x[1],reverse = true)
iv

隨機森林挑選特徵

import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import gridsearchcv
from sklearn.ensemble import randomforestclassifier
# 觀察預設引數的效能
rf0 = randomforestclassifier(oob_score=true, random_state=2333)
rf0.fit(x_train, y_train)
print('袋外分數：', rf0.oob_score_)
model_metrics(rf0, x_train, x_test, y_train, y_test)

rf = randomforestclassifier(n_estimators=120, max_depth=9, min_samples_split=50,
min_samples_leaf=20, max_features = 9,oob_score=true, random_state=2333)
rf.fit(x_train, y_train)
print('袋外分數：', rf.oob_score_)
model_metrics(rf, x_train, x_test, y_train, y_test)

需要學習了解的東西太多。虛心借鑑前輩的力量。

參考原文：

資料探勘專案（一）

第一次實踐資料探勘。虛心學習。基於機器學習的資料分析模型的建立，主要分為以下幾步資料獲取資料預處理模型選擇資料統一化模型建立模型結果分析首先要對資料進行評估，資料的大小來決定使用工具。本資料為金融資料，目的為貸款使用者是否會逾期。匯入資料 import pandas as pd im...

資料探勘專案（五）

目標任務模型調優使用網格搜尋法對5個模型進行調優調參時採用五折交叉驗證的方式並進行模型評估，記得展示的執行結果。網格搜尋是一種調參手段窮舉搜尋在所有候選的引數選擇中，通過迴圈遍歷，嘗試每一種可能性，表現最好的引數就是最終的結果。其原理就像是在陣列裡找最大值。為什麼叫網格搜尋？以有兩個引...

資料探勘如何做資料探勘專案

筆者鼓勵致力於從事資料行業的去參加一些人工智慧，機器學習的培訓，然後有人說其實很多企業不喜歡培訓出來的人，認為培訓不貼近實際，紙上談兵。我倒不這麼看，其實即使在企業內乾資料探勘的人，很多也出不了活，這個不僅僅涉及業務和技術，更是管理上的問題。任正非說，華為最後能留下來的財富只有兩樣一是管理框架 ...

資料探勘專案（二）

資料探勘專案（一）

資料探勘專案（五）

資料探勘 如何做資料探勘專案

相關推薦

資料探勘如何做資料探勘專案