當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Kaggle经典案例—信用卡诈骗检测的完整流程(学习笔记)

發布時間：2023/12/20 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 Kaggle经典案例—信用卡诈骗检测的完整流程(学习笔记) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本文此案例的完整流程和涉及知識

首先先看數據

import pandas as pd import matplotlib.pyplot as plt import numpy as np %matplotlib inline data = pd.read_csv("creditcard.csv") data.head() data.shape

好的，它長這個樣子。大致解釋一下V1-V28都是一系列的指標(具體是什么不用知道)，Amount是交易金額，Class＝0表示是正常操作，而=1表示異常操作。

明確目標：檢測是否異常，也就是說是一個二分類問題，接著想到用邏輯回歸建模。

1.觀察數據特征

Class=0的我們不妨稱之為負樣本，Class=1的稱正樣本，看一下正負樣本的數量。

count_classes = pd.value_counts(data['Class'],sort = True).sort_index() plt.figure(figsize=(10,6)) count_classes.plot(kind='bar') plt.title("Fraud class histogram") plt.xlabel("Class",size=20) plt.xticks(rotation=0) plt.ylabel("Number",size=20)

可以看出樣本數據嚴重不均衡，樣本類別不均衡將導致樣本量少的分類所包含的特征過少，并很難從中提取規律。同時你的學習結果會過度擬合這種不均的結果，通俗來說就是將你的學習結果用到一組分布均勻的數據上，擬合度會很差。
那么怎么解決這個問題呢？有兩種辦法

采樣方式選擇

（1）下采樣

對這個問題來說，下采樣采取的方法就是取正樣本中的一部分，使得正樣本和負樣本數量大致相同。就是讓樣本變得一樣少

（2）過采樣

相對的，過采樣的做法即再生成更多的負樣本數據，使得負樣本和正樣本一樣多。就是讓樣本變得一樣多

2.歸一化處理

繼續觀察數據，我們可以發現Amount這一列數據的浮動差異和V1-V28數據的浮動相比差距很大。在做模型之前要保證特征之間的分布差異是差不多的，否則會對我們的模型產生誤導，所以先對Amount做歸一化或者標準化做法如下，使用sklearn很方便

#在這里順便刪去了Time列，因為Time列對這個問題沒什么幫助 from sklearn.preprocessing import StandardScaler data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1)) data = data.drop(['Time','Amount'],axis=1) data.head()

3.采用下采樣處理數據

X = data.loc[:, data.columns != 'Class'] y = data.loc[:, data.columns == 'Class']#y=pd.DataFrame(data.loc[:,'Class'])或y=pd.DataFrame(data.Class) number_records_fraud = len(data[data.Class == 1]) fraud_indices = np.array(data[data.Class == 1].index) normal_indices = data[data.Class == 0].indexrandom_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False) #random.choince從所有正樣本索引中隨機選擇負樣本數量的正樣本索引，replace=False表示不進行替換 random_normal_indices = np.array(random_normal_indices) #拿出來后轉成array格式 under_sample_indices = np.concatenate([fraud_indices,random_normal_indices]) #合并隨機得到的正樣本index和負樣本 under_sample_data = data.iloc[under_sample_indices,:] #再用index定位得到數據 X_undersample = under_sample_data.loc[:, under_sample_data.columns != 'Class'] y_undersample = under_sample_data.loc[:, under_sample_data.columns == 'Class'] #X_undersample和y_undersampl即為經過下采樣處理后樣本 print("正樣本占總樣本: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data)) print("負樣本占總樣本 ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data)) print("總樣本數量", len(under_sample_data)) X_undersample.head(3) y_undersample.head(3)

得到的結果：

交叉驗證

把數據集切分成train(訓練集)和test(測試集)，通常八二分，再把train等分成3個集合

一.1+2------>3 表示用1和2建立model，用3當作驗證集
二.1+3------>2 同理即1和3建model，2當作驗證集
三.2+3------>1
這樣做的好處如果只做一次操作，假若樣本比較簡單會造成模型的效率比真實值高，而如果樣本存在離群值會使得模型效率比真實偏低。為了權衡兩者，這樣操作相當于求一個平均值，使得模型的擬合效果更理性
最后的評估效果：分別把用3，2，1的評估結果求平均值
代碼實現如下：

from sklearn.model_selection import train_test_split #sklearn中已經廢棄cross_validation,將其中的內容整合到model_selection中將sklearn.cross_validation 替換為 sklearn.model_selectionX_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0) #隨機切分，random_state=0類似設置隨機數種子，test_size就是測試集比例，我這里設置為0.3即0.7訓練集，0.3測試集print("原始樣本訓練集:", len(X_train)) print("原始樣本測試集: ", len(X_test)) print("原始樣本總數:", len(X_train)+len(X_test))#對下采樣數據也進行切分 X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample,y_undersample,test_size = 0.3,random_state = 0) print("") print("下采樣樣本訓練集: ", len(X_train_undersample)) print("下采樣樣本測試集: ", len(X_test_undersample)) print("下采樣樣本總數:", len(X_train_undersample)+len(X_test_undersample))

#Recall = TP/(TP+FN)通過召回率評估模型 #TP（true positives）FP（false positives）FN（false negatives）TN（true negatives） from sklearn.linear_model import LogisticRegression#引入邏輯回歸模型 from sklearn.model_selection import KFold, cross_val_score #KFlod指做幾倍的交叉驗證，cross_val_score為交叉驗證評估結果 from sklearn.metrics import confusion_matrix,recall_score,classification_report #confusion_matrix混淆矩陣

關于Recall的解釋這篇文章講的很清楚

正則化懲罰項

假設有兩組權重參數A和B，它們的RECALL值相同，但是A這組的方差遠大于B，那么A比B更容易出現** 過擬合(在訓練集效果良好但在測試集變現差)**的情況。所以為了得到B這樣的模型，引入正則化懲罰項。即把目標函數變成 損失函數+正則化懲罰項
正則化懲罰項分兩種：
L1：

L2:

def printing_Kfold_scores(x_train_data,y_train_data):#fold.split(y_train_data) c_param_range = [0.01,0.1,1,10,100] #正則化懲罰力度候選 results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score']) results_table['C_parameter'] = c_param_range# the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1] j = 0 for c_param in c_param_range:#找出最合適的正則化懲罰力度print('-------------------------------------------')print('C parameter: ', c_param)print('-------------------------------------------')print('')recall_accs = []for iteration, indices in enumerate(fold.split(y_train_data),start=1):lr = LogisticRegression(C = c_param, penalty = 'l1',solver='liblinear')#C是懲罰力度，penalty是選擇l1還是l2懲罰，solver可選參數:{‘liblinear’, ‘sag’, ‘saga’,‘newton-cg’, ‘lbfgs’}lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())#lr.fit:訓練lr模型,傳入dataframe的X和轉變成一行的yy_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)#lr.predict:用驗證樣本集進行預測recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)#recall_score：傳入結果集，和predict的結果得到評估結果recall_accs.append(recall_acc)print('Iteration ', iteration,': recall score = ', recall_acc)results_table.loc[j,'Mean recall score'] = np.mean(recall_accs)j += 1print('')print('Mean recall score ', np.mean(recall_accs))print('')best_c = results_table.loc[np.argmax(np.array(results_table['Mean recall score']))]['C_parameter']print('*********************************************************************************') print('Best model to choose from cross validation is with C parameter = ', best_c) print('*********************************************************************************') return best_c best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)

具體迭代過程就不看了，感興趣的可以復制過去跑一下，最終得到結果如下

用下采樣訓練的模型畫混淆矩陣

def plot_confusion_matrix(cm, classes,title='Confusion matrix',cmap=plt.cm.Blues):plt.imshow(cm, interpolation='nearest', cmap=cmap,aspect='auto')plt.title(title)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, rotation=0)plt.yticks(tick_marks, classes)thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, cm[i, j],horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")plt.tight_layout()plt.ylabel('True label')plt.xlabel('Predicted label')import itertools lr = LogisticRegression(C = best_c, penalty = 'l2') lr.fit(X_train_undersample,y_train_undersample.values.ravel()) y_pred_undersample = lr.predict(X_test_undersample.values)cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample) np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix') plt.show()

這個是用模型擬合下采樣測試集結果，我這個由于matplotlib庫版本問題數據有點錯位。
不過可以看出TP=138,TN=9,FP=9,FN看不太清不過和TP差不多
RECALL值有0.863

再用模型擬合原數據的測試集畫混淆矩陣

lr = LogisticRegression(C = best_c, penalty = 'l1',solver='liblinear') lr.fit(X_train_undersample,y_train_undersample.values.ravel()) y_pred = lr.predict(X_test.values)# Compute confusion matrix cnf_matrix = confusion_matrix(y_test,y_pred) np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))# Plot non-normalized confusion matrix class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix') plt.show()

RECALL值滿意需求，但是還是存在問題。FP這類有8000多個，也就是說** 原本正常被當初異常即“誤殺”的樣本有8000多個，會使得精度降低**

4.對比下采樣和直接拿原始數據訓練模型

best_c = printing_Kfold_scores(X_train,y_train) #用原始數據訓練，找最佳的正則化懲罰項 lr = LogisticRegression(C = best_c, penalty = 'l2') lr.fit(X_train,y_train.values.ravel()) y_pred_undersample = lr.predict(X_test.values)# Compute confusion matrix cnf_matrix = confusion_matrix(y_test,y_pred_undersample) np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))# Plot non-normalized confusion matrix class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix') plt.show()

可以看到結果很不理想，RECALL值很低，所以樣本不均的情況下不做處理做出的模型通常很差。

5.邏輯回歸閾值對結果的影響

lr = LogisticRegression(C = 0.01, penalty = 'l1',solver='liblinear') lr.fit(X_train_undersample,y_train_undersample.values.ravel()) y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values) #lr.predict_proba 預測出一個概率值thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] #指定一系列閾值 plt.figure(figsize=(12,10))j = 1 for i in thresholds:y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > iplt.subplot(3,3,j)j += 1cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))# Plot non-normalized confusion matrixclass_names = [0,1]plot_confusion_matrix(cnf_matrix, classes=class_names,title='Threshold >= %s'%i) #右上角是誤殺的，左下角是沒被揪出來的異常

原來默認是概率大于0.5就認為是異常，這個閾值可以自己設定，閾值越大即表示越嚴格。
可以看出不同閾值對結果的影響，RECALL是一個遞減的過程，精度逐漸增大

所以閾值的選取通常根據實際要求合理選取，好的模型RECALL和精度都要保證盡量高。

總結

以上是生活随笔為你收集整理的Kaggle经典案例—信用卡诈骗检测的完整流程(学习笔记)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：《谁说大象不能跳舞》值得一读
下一篇：一汽启明的PDM解决方案_三木_新浪博客