使用決策樹分箱計算woe和iv值例項

使用信用卡預期資料--kaggle案例的訓練資料 - give me some credit；

目標變數為seriousdlqin2yrs：表示未來是否為逾期90天+，1表示逾期90天+，即通常意義上的壞客戶，0則表示沒有逾期90天+的好客戶。

import numpy as np
import pandas as pd
from sklearn.tree import decisiontreeclassifier
def optimal_binning_boundary(x: pd.series, y: pd.series, nan: float = -999.) -> list:
'''利用決策樹獲得最優分箱的邊界值列表
'''boundary =   # 待return的分箱邊界值列表
x = x.fillna(nan).values  # 填充缺失值
y = y.values
clf = decisiontreeclassifier(criterion='entropy',  # 「資訊熵」最小化準則劃分
max_leaf_nodes=6,  # 最大葉子節點數
min_samples_leaf=0.05)  # 葉子節點樣本數量最小佔比
clf.fit(x.reshape(-1, 1), y)  # 訓練決策樹
n_nodes = clf.tree_.node_count
children_left = clf.tree_.children_left
children_right = clf.tree_.children_right
threshold = clf.tree_.threshold
for i in range(n_nodes):
if children_left[i] != children_right[i]:  # 獲得決策樹節點上的劃分邊界值
boundary.sort()
min_x = x.min()
max_x = x.max() + 0.1  # +0.1是為了考慮後續groupby操作時，能包含特徵最大值的樣本
boundary = [min_x] + boundary + [max_x]
return boundary
def feature_woe_iv(x: pd.series, y: pd.series, nan: float = -999.) -> pd.dataframe:
'''計算變數各個分箱的woe、iv值，返回乙個dataframe
'''x = x.fillna(nan)
boundary = optimal_binning_boundary(x, y, nan)  # 獲得最優分箱邊界值列表
df = pd.concat([x, y], axis=1)  # 合併x、y為乙個dataframe，方便後續計算
df.columns = ['x', 'y']  # 特徵變數、目標變數欄位的重新命名
df['bins'] = pd.cut(x=x, bins=boundary, right=false)  # 獲得每個x值所在的分箱區間,
# right為false代表右邊是開區間
grouped = df.groupby('bins')['y']  # 統計各分箱區間的好、壞、總客戶數量
result_df = grouped.agg([('good', lambda y: (y == 0).sum()),
('bad', lambda y: (y == 1).sum()),
('total', 'count')])
result_df['good_pct'] = result_df['good'] / result_df['good'].sum()  # 好客戶佔比
result_df['bad_pct'] = result_df['bad'] / result_df['bad'].sum()  # 壞客戶佔比
result_df['total_pct'] = result_df['total'] / result_df['total'].sum()  # 總客戶佔比
result_df['bad_rate'] = result_df['bad'] / result_df['total']  # 壞比率
result_df['woe'] = np.log(result_df['good_pct'] / result_df['bad_pct'])  # woe
result_df['iv'] = (result_df['good_pct'] - result_df['bad_pct']) * result_df['woe']  # iv
print(f"該變數iv = ")
return result_df
if __name__ == "__main__":
data = pd.read_csv('./data/cs-training.csv')
# boundary = optimal_binning_boundary(x=data['revolvingutilizationofunsecuredlines'],
#                                     y=data['seriousdlqin2yrs'])
# print(boundary)
result_df = feature_woe_iv(x=data['revolvingutilizationofunsecuredlines'],
y=data['seriousdlqin2yrs'])
print(result_df)
result_df.to_excel("./gen_data/result.xlsx")

輸出成excel檔案，可使用excel功能突出顯示資料便於視覺化，設定如下：

參考：

python決策樹分箱快速分箱方法

python 分箱的一種方法 2018.08.02 r語言中有smbining可以進行最優分箱，python中分箱如果既要考慮箱體個數，分箱後資訊量大小，也要考慮單調性等其他因素。這裡給出一種簡單的通過iv值來選擇如果分箱的方法。下面是按照分位數來分的，還可以按照卡房分箱，決策樹分箱等。參照toad...

決策樹和CART決策樹

首先簡單介紹下決策樹說到決策樹肯定離不開資訊熵什麼是資訊熵不要被這名字唬住，其實很簡單乙個不太可能的時間居然發生了，要比乙個非常可能的時間發生提供更多的資訊。訊息說今天早上太陽公升起資訊量是很少的，以至於沒有必要傳送。但另一條訊息說今天早上日食資訊量就很豐富。概率越大資訊量就越少,與...

python決策樹分箱 python的等深分箱例項

背景方法展示話不多說上以下為等深分箱以及encoding方法 coding utf 8 created on tue jan 29 17 26 38 2019 author damomwcg class equal depth box def equal box list,bin num pa...

使用決策樹分箱計算woe和iv值例項

python決策樹分箱 快速分箱方法

決策樹和CART決策樹

python決策樹分箱 python的等深分箱例項

相關推薦

python決策樹分箱快速分箱方法