資料探勘 ctr特徵

訓練集分成k-fold，用其他k-1 fold計算ctr，然後merge給第k個fold，遍歷k次。然後訓練集整體計算ctr，再merge給測試集。

def ctr_fea(train,test,feature):
for fea in feature:
print(fea)
temp = train[['label',fea]].groupby(fea)['label'].agg().reset_index()
temp[fea+'_ctr'] = temp[fea+'_sum']/(temp[fea+'_count']+10)
test = test.merge(temp[[fea,fea+'_ctr']],on=fea,how='left')
for i in range(len(feature)-1):
for j in range(i+1,len(feature)):
col = [feature[i],feature[j]]
print(col)
temp = train[['label',feature[i],feature[j]]].groupby(col)['label'].agg().reset_index()
temp['_'.join(col)+'_ctr'] = temp['_'.join(col)+'_sum']/(temp['_'.join(col)+'_count']+10)
test = test.merge(temp[col+['_'.join(col)+'_ctr']],on=col,how='left')
return test
train['label'] = label
train_new = none
skf = stratifiedkfold(n_splits=5,random_state=2019,shuffle=true)
for i,(train_index,valid_index) in enumerate(skf.split(train,label)):
print('flod_{}'.format(i+1))
temp = ctr_fea(train.iloc[train_index],train.iloc[valid_index],feature)
train_new = pd.concat([train_new,temp])
test_new = ctr_fea(train,test,feature)

1、做完ctr之後，訓練集和測試集的順序已經變掉，要按照id進行排序，否則和label不匹配，會出現訓練錯誤。

2、注意區分label.values和label = train[『label』]，是兩種不同的結構。

資料探勘特徵工程

特徵工程常見的特徵工程包括總結 1 特徵工程的主要目的是將資料轉換為能更好地表示潛在問題的特徵，從而提高機器學習的效能。比如，異常值處理為了去除雜訊，填補缺失值可以加入先驗知識等。2 特徵構造屬於特徵工程的一部分，目的是為了增強資料的表達。3 如果特徵是匿名特徵，並不知道特徵相互之間的關聯性，這...

資料探勘之特徵工程

標籤編碼與獨熱編碼 onehotencoder獨熱編碼和 labelencoder標籤編碼資料探勘的基本流程多項式特徵特徵構建生成多項式特徵對於特徵離散化，特徵交叉，連續特徵離散化非常經典的解釋資料預處理與特徵選擇特徵工程到底是什麼？機器學習中的資料清洗與特徵處理綜述 sklearn ...

資料探勘專案特徵選擇

基於處理好的資料 data.csv data pd.read csv data.csv encoding gbk y data status x cl data.drop status axis 1 計算 iv 函式 def cal iv x,y,n bins 6,null value np.nan...

資料探勘 ctr特徵

資料探勘 特徵工程

資料探勘之特徵工程

資料探勘專案 特徵選擇

相關推薦

資料探勘特徵工程

資料探勘專案特徵選擇