原理很簡單,初始分20箱或更多,先確保每箱中都含有0,1標籤,對不包含0,1標籤的箱向前合併,計算各箱卡方值,對卡方值最小的箱向後合併,**如下
import pandas as pd
import numpy as np
import scipy
from scipy import stats
defchi_bin
(df,var,target,binnum=
5,maxcut=20)
:'''
df:data
var:variable
target:target / label
binnum: the number of bins output
maxcut: initial bins number
'''data=df[
[var,target]
]#equifrequent cut the var into maxcut bins
data[
"cut"
],breaks=pd.qcut(data[var]
,q=maxcut,duplicates=
"drop"
,retbins=
true
)#count 1,0 in each bin
count_1=data.loc[data[target]==1
].groupby(
"cut"
)[target]
.count(
) count_0=data.loc[data[target]==0
].groupby(
"cut"
)[target]
.count(
)#get bins value: min,max,count 0,count 1
bins_value=[*
zip(breaks[
:maxcut-1]
,breaks[1:
],count_0,count_1)
]#define woe
defwoe_value
(bins_value)
: df_woe=pd.dataframe(bins_value)
df_woe.columns=
["min"
,"max"
,"count_0"
,"count_1"
] df_woe[
"total"
]=df_woe.count_1+df_woe.count_0
df_woe[
"bad_rate"
]=df_woe.count_1/df_woe.total
df_woe[
"woe"
]=np.log(
(df_woe.count_0/df_woe.count_0.
sum())
/(df_woe.count_1/df_woe.count_1.
sum())
)return df_woe
#define iv
defiv_value
(df_woe)
: rate=
(df_woe.count_0/df_woe.count_0.
sum())
-(df_woe.count_1/df_woe.count_1.
sum())
iv=np.
sum(rate * df_woe.woe)
return iv
#make sure every bin contain 1 and 0
##first bin merge backwards
for i in
range
(len
(bins_value)):
if0in bins_value[0]
[2:]
: bins_value[0:
2]=[
( bins_value[0]
[0],
bins_value[1]
[1],
bins_value[0]
[2]+bins_value[1]
[2],
bins_value[0]
[3]+bins_value[1]
[3])
]continue
##bins merge forwardsif0
in bins_value[i][2
:]: bins_value[i-
1:i+1]
=[( bins_value[i-1]
[0],
bins_value[i][1
],bins_value[i-1]
[2]+bins_value[i][2
],bins_value[i-1]
[3]+bins_value[i][3
])]break
else
:break
#calculate chi-square merge the minimum chisquare
while
len(bins_value)
>binnum:
chi_squares=
for i in
range
(len
(bins_value)-1
):a=bins_value[i][2
:]b=bins_value[i+1]
[2:]
chi_square=scipy.stats.chi2_contingency(
[a,b])[
0]#merge the minimum chisquare backwards
i = chi_squares.index(
min(chi_squares)
)
bins_value[i:i+2]
=[( bins_value[i][0
],bins_value[i+1]
[1],
bins_value[i][2
]+bins_value[i+1]
[2],
bins_value[i][3
]+bins_value[i+1]
[3])
]
df_woe=woe_value(bins_value)
#print bin number and iv
print
("箱數:{},iv:"
.format
(len
(bins_value)
,iv_value(df_woe)))
#return bins and woe information
return woe_value(bins_value)
以下是效果:
初始分成10箱,目標為3箱
chi_bin(data,
"age"
,"seriousdlqin2yrs"
,binnum=
3,maxcut=
10)
箱數:8,iv:0.184862箱數:7,iv:0.184128
箱數:6,iv:0.179518
箱數:5,iv:0.176980
箱數:4,iv:0.172406
箱數:3,iv:0.160015
min max count_0 count_1 total bad_rate woe
0 0.0 52.0 70293 7077 77370 0.091470 -0.266233
1 52.0 61.0 29318 1774 31092 0.057056 0.242909
2 61.0 72.0 26332 865 27197 0.031805 0.853755
Python變數分箱 woe值單調分箱
最近上傳了乙個變數分箱的方法到pypi,這個包主要有以下說明 缺失值單獨一箱,不論缺失的數量多少 生成的分箱woe值是單調的,後續有時間會迭代u型分箱的版本 會有分箱最小樣本數佔比,類似決策樹的最小葉節點佔比 分箱成功的變數才會保留,有可能失敗的情況是找不出同時滿足上述2和3的分箱 增加了多程序,提...
連續變數最優分箱 基於CART演算法
關於變數分箱主要分為兩大類 有監督型和無監督型 對應的分箱方法 a.無監督 1 等寬 2 等頻 3 聚類 b.有監督 1 卡方分箱法 chimerge 2 id3 c4.5 cart等單變數決策樹演算法 3 信用評分建模的iv最大化分箱 等 本篇使用python,基於cart演算法對連續變數進行最優...
基於python處理問卷資料並進行卡方分析全流程
如果只關心卡方分析的 請直接跳到最後,前面是python解析execl資料 受經管的同學所託處理了一下問卷資料。程式設計環境 jupyter notebook 環境 python3.6 分享 import pandas as pd import numpy as np from scipy.stat...