整個專案做下來,感受最深的其實是如何從眾多的特徵中選出最實用的特徵變數。
#資料讀取,skiprows的含義表示跳過第一行,從第二行開始讀取data = pd.read_csv("g:\data\loanstats_2016q2\loanstats_2016q2.csv",skiprows=1,low_memory = true)
#刪去缺失資料較多的列data.drop("id",axis=1,inplace=true);
data.drop("member_id",axis=1,inplace=true)
#擷取每條資料中的數字部分data.term.replace(to_replace='[^0-9]+',value = "",inplace = true, regex = true)
data.int_rate.replace("%",value = "",inplace = true)
#刪除狀態過多的文字列,狀態過多會導致後面做啞變數編碼時產生大量的資料,建議先先刪除data.drop("sub_grade",axis=1,inplace=true)
data.drop("emp_title",axis=1,inplace=true)
#處理工作年限:替換空值為np.nan,提取數值部分data.emp_length.replace("n/a",np.nan,inplace = true)
data.emp_length.replace(to_replace='[^0-9]+',value= "",inplace = true,regex = true)
#刪除全部為空的列,all:全部為空才匹配;any:任意乙個單元格為空都匹配# axis=1表示已列為單位,axis=0表示以行為單位
data.dropna(axis=1,how="all",inplace=true)
#刪除全部為空的行
data.dropna(axis=0,how="all",inplace=true)
#統計所有列的非空資訊print(data.info(verbose=true,null_counts = true))
#批量刪除下面的列,下面的列大部分的資料都是空的,因此將他們刪除
data.drop(["hardship_type","hardship_reason","hardship_status","deferral_term","hardship_amount","hardship_start_date",\
"hardship_end_date","payment_plan_start_date","hardship_length","hardship_dpd","hardship_loan_status",\
"orig_projected_additional_accrued_interest","hardship_payoff_balance_amount","hardship_last_payment_amount",\
"debt_settlement_flag_date","settlement_status","settlement_date","settlement_amount","settlement_percentage",\
"settlement_term"],axis=1,inplace=true)
#計算資料中列和列之間的關聯度cor = data.corr()
#取矩陣中下半部的部分,cor.iloc一行一行的記錄
cor.iloc[:,:]= np.tril(cor,k = -1);
cor = cor.stack()#把所有行堆成一列
#print(cor[cor>0.95])#篩選出大於0.95的記錄
#刪除關聯度高的多方中的一方
data.drop(["funded_amnt","funded_amnt_inv","out_prncp_inv","total_pymnt_inv","total_rec_prncp",\
"collection_recovery_fee","num_rev_tl_bal_gt_0","num_sats",\
"tot_hi_cred_lim","total_il_high_credit_limit"],axis=1,inplace=true)
#對於型別是object的列,如果型別太少或者太多,都可以將它刪掉for col in data.select_dtypes(include = ["object"]).columns:
#print(len(data[col].unique()))
print("col {} has {}".format(col,len(data[col].unique())))
#刪除型別過少或者過多的列
data.drop([
"grade","home_ownership","verification_status","issue_d","pymnt_plan",
"hardship_flag","disbursement_method","debt_settlement_flag","earliest_cr_line","revol_util"],axis=1,inplace=true)
#因變數處理,這裡我們暫時只考慮2分類問題,所以只保留fully paid,charged offdata.loan_status.replace("fully paid",value = int(1),inplace = true)
data.loan_status.replace("charged off",value = int(0),inplace = true)
data.loan_status.replace("current",value = np.nan,inplace = true)
data.loan_status.replace("late (31-120 days)",value = np.nan,inplace = true)
data.loan_status.replace("in grace period",value = np.nan,inplace = true)
data.loan_status.replace("late (16-30 days)",value = np.nan,inplace = true)
data.loan_status.replace("default",value = np.nan,inplace = true)
data.dropna(subset=["loan_status"],axis=0,how="any",inplace=true)
#用0.0來填充np.nan
data.fillna(0.0,inplace=true)
#亞編碼下面實用邏輯回歸來金融發欺詐模型的構建data = pd.get_dummies(data)
data.to_csv("g:\data\loanstats_2016q2\loanstats_2016q2_3.csv")
path = "g:\data\loanstats_2016q2\loanstats_2016q2_3.csv"data = pd.read_csv(path)
y = data.loan_status
x = data.drop("loan_status",axis=1,inplace=false)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
lr = logisticregression()
lr.fit(x_train,y_train)
test_predict = lr.predict(x_test)
print(metrics.accuracy_score(test_predict,y_test))
print(metrics.recall_score(test_predict,y_test))
金融反欺詐模型 專案實戰 機器學習
本文旨在通過乙個完整的實戰例子,演示從源資料清洗到特徵工程建立,再到模型訓練,以及模型驗證和評估的乙個機器學習的完整流程。由於初識機器學習,會比較多的困惑,希望通過借助這個實戰的例子,可以幫助大家對機器學習了乙個初步的認識。本文旨在通過乙個完整的實戰例子,演示從源資料清洗到特徵工程建立,再到模型訓練...
社交網路演算法在金融反欺詐中的應用
網際網路和金融的結晶 個人對個人的信用貸款 極速信任 自動化信用評估 客戶獲取 信用評估 交易促成 客戶服務 網際網路金融行業中的欺詐 欺詐交易 反欺詐中可應用到多種社交網路演算法 社交網路演算法在金融反欺詐中的優勢 構建金融知識圖譜fingraph 包含 身份證.銀行卡,信用卡,ip,裝置號,地理...
反欺詐建模方案
在反欺詐場景中,知識圖譜聚合各類資料來源,逐步繪製出借款人的profile,從而針對性的識別欺詐風險。以乙個借款人舉例,借款人可以有身份證號,手機號,學歷等個人資訊,屬於個人的屬性資訊 而借款人可以有擔保人或是親屬好友,借款人與擔保人之間的關係 也就是邊edge 是被擔保與擔保的關係,借款人與其親屬...