筆記整理信用卡欺詐乾淨資料簡單操作

#沒學會的東西都是沉沒成本

#不及時複習的後果就是浪費更多的時間

12月看的 2月又來反思了！

1.檢視資料特徵，一般我們認為欺詐數目是少數

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data=pd.values_counts(data[
'class'
],sort=true).sort_index(
)# .sort_index() 
count_class.plot(kind=
'bar'
)plt.title(
'fraud class histogram'
)plt.xlabel(
'class'
)plt.ylabel(
'frequency'
)

2.資料的歸一化處理

from sklearn_preprocessing import standardscaler
data[
'normamount'
]=standardscaler(
).fit_transform(data[
'amount'
].reshape(-1,1))
#-1表示你根據原資料自己算  我的列反正是1  
data=data.drop(
['time','amount'
],axis=1)
data.head

3.下取樣

#下取樣
# x=data.ix[:,data.columns!=
'class'
]y=data.ix[:,data.columns==
'class'
]#number of data points in minority class
num_records_fraud=len(data[data.class==1]
)fruad_indices=np.array(data[data.class==1].index)
#picking the indices of normal classes  隨機選中和fraud 一樣的記錄 和normal一樣
random_normal_indices=np.random.choice(normal_indices,number_records_fraud,replace=false)
random_normal_indices=array[random_normal_indices]
under_sample_indices=np.concatenate(
[random_normal_indices,random_normal_indices]
)#記的
#under sample dataset   這個方法記住，取x和y
under_sample_data=data.iloc[under_sample_indices,:]
x_undersample=x_under_sample_data.ix[:,under_sample_data.columns!=
'class'
]y_undersample=y_under_sample_data.ix[:,under_sample_data.columns==
'class'
]#showing ratio
print(
"percentage of normal transactions:",len(x_undersample[x_undersample==0]
)/len(x_undersample))
print(
"percentage of normal transactions:",len(x_undersample[x_undersample==1]
)/len(x_undersample))
print(
"total number of transactions in resampled data:",len(under_sample_data))

3.提煉出訓練集和操作集

from slearn.cross_validation import train_test_split
#whole dataset
x_train,x_test,y_train,y_test=
train_test_split
(x,y,test_size=
0.3,random_state=0)
print
("number transactions train dataset:"
,len
(x_train)
)print
("number transactions test dataset:"
,len
(x_test)
)print
("total mumber of transactions:"
,len
(x_train)
+len
(x_test)
)#跟原始資料比較
#undersampled dataset
x_train_undersample,x_test_undersample,y_train_undersample,y_test_undersample=
train_test_split
(x_undersample,
y_undersample,
test_size=
0.3,
random_state=0)
print
("*************"
)print
('number transactions train dataset:'
,len
(x_train_undersample)
)print
('number transactions test dataset:'
,len
(x_test_undersample)
)print
('total number of transactions:'
,len
(x_train_undersample)
+len
(x_test_undersample)
)

4.開始驗證

from sklearn.linear_model import logisticregression
from sklearn.cross.validation import kfolds,cross_value_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report

''
'交叉驗證
kfold
fold = kfold(要交叉驗證資料集的個數，分為幾份(預設3),是否每次驗證要洗牌(預設shuffle=false))例項化
fold是乙個可迭代物件，可以用列舉的方法獲取其，索引和值
for iteration, indices in enumerate(fold, start=1):
pass
這裡的iteration是第幾次交叉驗證的次數
indices是  「要交叉驗證資料集的個數」，在該次交叉驗證被劃分的兩部分(訓練集+驗證集)'
''

def printing_kfold_scores(x_train_data,y_train_data):

信用卡反欺詐

信用卡反欺詐一背景反欺詐是一項識別服務，是對交易詐騙網路詐騙詐騙盜卡盜號等行為的一項風險識別。其核心是通過大資料的收集分析和處理，建立反欺詐信用評分和反欺詐模型，解決不同場景中的風險問題。二資料集分析資料樣本為2013年9月歐洲持卡人在兩天內進行的284,808筆信用卡交易，其中4...

機器學習專案實戰之信用卡欺詐檢測

反欺詐應用的機器模型演算法，多為二分類演算法。1 gbdt梯度提公升決策樹 gradient boosting decision tree，gbdt 演算法，該演算法的效能高，且在各類資料探勘中應用廣泛，表現優秀，被應用的場景較多。2 logistic回歸又稱logistic回歸分析，是一種廣義的線...

大資料分析實戰信用卡欺詐檢測

假設有乙份信用卡交易記錄，遺憾的是資料經過了脫敏處理，只知道其特徵，卻不知道每乙個字段代表什麼含義，沒關係，就當作是乙個個資料特徵。在資料中有兩種類別，分別是正常交易資料和異常交易資料，欄位中有明確的識別符號。要做的任務就是建立邏輯回歸模型，以對這兩類資料進行分類，看起來似乎很容易，但實際應用時會出...

筆記整理 信用卡欺詐 乾淨資料簡單操作

信用卡反欺詐

機器學習專案實戰之信用卡欺詐檢測

大資料分析實戰 信用卡欺詐檢測

相關推薦

筆記整理信用卡欺詐乾淨資料簡單操作

大資料分析實戰信用卡欺詐檢測