當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习之信用卡欺诈检测

發布時間：2023/12/20 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习之信用卡欺诈检测小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

機器學習之信用卡欺詐檢測

一、機器學習之信用卡欺詐檢測
- 1.1 前言
- 1.2 案例分析
- - 1.2.1 導入所需模塊到 python 環境
  - 1.2.2 讀取數據，刪除無用的Time列
  - 1.2.3 探索性數據分析及數據預處理
  - 1.2.4 構建六類分類模型
  - 1.2.5 用評估指標評估創建的分類模型

一、機器學習之信用卡欺詐檢測

1.1 前言

數據來源：Kaggle 信用卡欺詐檢測數據集https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?resource=download；

本文采用 XGBoost、隨機森林、KNN、邏輯回歸、SVM 和決策樹解決信用卡欺詐檢測問題；

1.2 案例分析

1.2.1 導入所需模塊到 python 環境

# 1、導入所需模塊到 python 環境中 import pandas as pd import numpy as np import matplotlib.pyplot as plt from termcolor import colored as cl import itertools from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from xgboost import XGBClassifier from sklearn.metrics import confusion_matrix, accuracy_score, f1_score

1.2.2 讀取數據，刪除無用的Time列

關于數據：我們將要使用的數據是 Kaggle 信用卡欺詐檢測數據集。它包含特征 V1 到 V28，是 PCA 獲得的主要成分，并忽略對構建模型沒有用的時間特征。
其余的特征是包含交易總金額的"金額"特征和包含交易是否為欺詐案件的"類別"特征，類別0標識欺詐，類別1表示正常。

df = pd.read_csv(r'../creditcard.csv') print("Data's columns contain:\n", df.columns) print("Data shape:\n", df.shape) df.drop('Time', axis=1, inplace=True) pd.set_option('display.max_columns', df.shape[1]) print(df.head()) ''' Data's columns contain:Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20','V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount','Class'],dtype='object') Data shape:(284807, 31)V1 V2 V3 V4 V5 V6 V7 \ 0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 V8 V9 V10 V11 V12 V13 V14 \ 0 0.098698 0.363787 0.090794 -0.551600 -0.617801 -0.991390 -0.311169 1 0.085102 -0.255425 -0.166974 1.612727 1.065235 0.489095 -0.143772 2 0.247676 -1.514654 0.207643 0.624501 0.066084 0.717293 -0.165946 3 0.377436 -1.387024 -0.054952 -0.226487 0.178228 0.507757 -0.287924 4 -0.270533 0.817739 0.753074 -0.822843 0.538196 1.345852 -1.119670 V15 V16 V17 V18 V19 V20 V21 \ 0 1.468177 -0.470401 0.207971 0.025791 0.403993 0.251412 -0.018307 1 0.635558 0.463917 -0.114805 -0.183361 -0.145783 -0.069083 -0.225775 2 2.345865 -2.890083 1.109969 -0.121359 -2.261857 0.524980 0.247998 3 -0.631418 -1.059647 -0.684093 1.965775 -1.232622 -0.208038 -0.108300 4 0.175121 -0.451449 -0.237033 -0.038195 0.803487 0.408542 -0.009431 V22 V23 V24 V25 V26 V27 V28 \ 0 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 1 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 3 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 4 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 Amount Class 0 149.62 0 1 2.69 0 2 378.66 0 3 123.50 0 4 69.99 0 '''

1.2.3 探索性數據分析及數據預處理

cases = len(df) nonfraud_cases = df[df.Class == 0] # 非欺詐 fraud_cases = df[df.Class == 1] # 欺詐 fraud_percentage = round(len(nonfraud_cases) / cases * 100, 2) print(cl('CASE COUNT', attrs=['bold'])) print(cl('-' * 40, attrs=['bold'])) print(cl('Total number of cases are {}'.format(cases), attrs=['bold'])) print(cl('Number of Non-fraud cases are {}'.format(len(nonfraud_cases)), attrs=['bold'])) print(cl('Number of fraud cases are {}'.format(len(fraud_cases)), attrs=['bold'])) print(cl('Percentage of fraud cases is {}%'.format(fraud_percentage), attrs=['bold'])) print(cl('-' * 40, attrs=['bold'])) print(cl('CASE AMOUNT STATISTICS', attrs=['bold'])) print(cl('-' * 40, attrs=['bold'])) print(cl('NON-FRAUD CASE AMOUNT STATS', attrs=['bold'])) print(nonfraud_cases.Amount.describe()) print(cl('-' * 40, attrs=['bold'])) print(cl('FRAUD CASE AMOUNT STATS', attrs=['bold'])) print(fraud_cases.Amount.describe()) print(cl('-' * 40, attrs=['bold'])) # 通過查看，‘Amount’金額變化較大，需對其進行標準化 sc = StandardScaler() amount = df.Amount.values df.Amount = sc.fit_transform(amount.reshape(-1, 1)) print(cl(df.Amount.head(10), attrs=['bold'])) # 特征選擇和數據集拆分 x = df.drop('Class', axis=1).values y = df.Class.values x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0) ''' CASE COUNT ---------------------------------------- Total number of cases are 284807 Number of Non-fraud cases are 284315 Number of fraud cases are 492 Percentage of fraud cases is 99.83% ---------------------------------------- CASE AMOUNT STATISTICS ---------------------------------------- NON-FRAUD CASE AMOUNT STATS count 284315.000000 mean 88.291022 std 250.105092 min 0.000000 25% 5.650000 50% 22.000000 75% 77.050000 max 25691.160000 Name: Amount, dtype: float64 ---------------------------------------- FRAUD CASE AMOUNT STATS count 492.000000 mean 122.211321 std 256.683288 min 0.000000 25% 1.000000 50% 9.250000 75% 105.890000 max 2125.870000 Name: Amount, dtype: float64 ---------------------------------------- 0 0.244964 1 -0.342475 2 1.160686 3 0.140534 4 -0.073403 5 -0.338556 6 -0.333279 7 -0.190107 8 0.019392 9 -0.338516 Name: Amount, dtype: float64'''

1.2.4 構建六類分類模型

Decision Tree

tree_model = DecisionTreeClassifier(max_depth=4, criterion='entropy').fit(x_train, y_train) tree_yhat = tree_model.predict(x_test)

K-Nearest Neighbors

knn_model = KNeighborsClassifier(n_neighbors=5).fit(x_train, y_train) knn_yhat = knn_model.predict(x_test)

Logistic Regression

lr_model = LogisticRegression().fit(x_train, y_train) lr_yhat = lr_model.predict(x_test)

SVM

svm_model = SVC().fit(x_train, y_train) svm_yhat = svm_model.predict(x_test)

Random Forest Tree

rf_model = RandomForestClassifier(max_depth=4).fit(x_train, y_train) rf_yhat = rf_model.predict(x_test)

XGBoost

xgb_model = XGBClassifier(max_depth=4).fit(x_train, y_train) xgb_yhat = xgb_model.predict(x_test)

1.2.5 用評估指標評估創建的分類模型

準確率

print(cl('-' * 40, attrs=['bold'])) print(cl('ACCURACY SCORE', attrs=['bold'])) print(cl('Accuracy score of the Decision Tree model is {}'.format(round(accuracy_score(y_test, tree_yhat), 4)),attrs=['bold'])) print(cl('Accuracy score of the knn model is {}'.format(round(accuracy_score(y_test, knn_yhat), 4)), attrs=['bold'])) print(cl('Accuracy score of the Logistic Regression model is {}'.format(round(accuracy_score(y_test, lr_yhat), 4)),attrs=['bold'])) print(cl('Accuracy score of the SVM model is {}'.format(round(accuracy_score(y_test, svm_yhat), 4)), attrs=['bold'])) print(cl('Accuracy score of the Random Forest model is {}'.format(round(accuracy_score(y_test, rf_yhat), 4)),attrs=['bold'])) print(cl('Accuracy score of the XGBoost model is {}'.format(round(accuracy_score(y_test, xgb_yhat), 4)), attrs=['bold'])) ''' ACCURACY SCORE Accuracy score of the Decision Tree model is 0.9994 Accuracy score of the knn model is 0.9995 Accuracy score of the Logistic Regression model is 0.9992 Accuracy score of the SVM model is 0.9993 Accuracy score of the Random Forest model is 0.9993 Accuracy score of the XGBoost model is 0.9995 '''

F1值

print(cl('-' * 40, attrs=['bold'])) print(cl('F1 SCORE', attrs=['bold'])) print(cl('F1 score of the Decision Tree model is {}'.format(round(f1_score(y_test, tree_yhat), 4)), attrs=['bold'])) print(cl('F1 score of the knn model is {}'.format(round(f1_score(y_test, knn_yhat), 4)), attrs=['bold'])) print(cl('F1 score of the Logistic Regression model is {}'.format(round(f1_score(y_test, lr_yhat), 4)), attrs=['bold'])) print(cl('F1 score of the SVM model is {}'.format(round(f1_score(y_test, svm_yhat), 4)), attrs=['bold'])) print(cl('F1 score of the Random Forest model is {}'.format(round(f1_score(y_test, rf_yhat), 4)), attrs=['bold'])) print(cl('F1 score of the XGBoost model is {}'.format(round(f1_score(y_test, xgb_yhat), 4)), attrs=['bold'])) ''' F1 SCORE F1 score of the Decision Tree model is 0.8105 F1 score of the knn model is 0.8571 F1 score of the Logistic Regression model is 0.7356 F1 score of the SVM model is 0.7771 F1 score of the Random Forest model is 0.7657 F1 score of the XGBoost model is 0.8449 '''

混淆矩陣

def plot_confusion_matrix(cm, classes, title, cmap=plt.cm.Blues):title = 'Confusion Matrix of {}'.format(title)plt.imshow(cm, cmap=cmap)plt.title(title)plt.colorbar()marks = np.arange(len(classes))plt.xticks(marks, classes, rotation=45)plt.yticks(marks, classes)thresh = cm.max() / 2for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): # 笛卡爾積plt.text(j, i, format(cm[i, j], 'd'), horizontalalignment='center',color='white' if cm[i, j] > thresh else 'black')"""設置文字說明plt.text(x,y,string,fontsize=15,verticalalignment="top",horizontalalignment="right") 參數：x,y:表示坐標值上的值string:表示說明文字 fontsize:表示字體大小 verticalalignment：垂直對齊方式，參數：[ ‘center’ | ‘top’ | ‘bottom’ | ‘baseline’ ] horizontalalignment：水平對齊方式，參數：[ ‘center’ | ‘right’ | ‘left’ ]"""plt.tight_layout()plt.ylabel('True label')plt.xlabel('Predict label')# 計算混淆矩陣 tree_matrix = confusion_matrix(y_test, tree_yhat, labels=[0, 1]) knn_matrix = confusion_matrix(y_test, knn_yhat, labels=[0, 1]) lr_matrix = confusion_matrix(y_test, lr_yhat, labels=[0, 1]) svm_matrix = confusion_matrix(y_test, svm_yhat, labels=[0, 1]) rf_matrix = confusion_matrix(y_test, rf_yhat, labels=[0, 1]) xgb_matrix = confusion_matrix(y_test, xgb_yhat, labels=[0, 1]) # 通過rc配置文件來自定義圖形的各種默認屬性 plt.rcParams['figure.figsize'] = (6, 6) classes = ['Non-fraud(0)', 'Fraud(1)'] tree_cm_plot = plot_confusion_matrix(tree_matrix,classes =classes, title='Decision Tree') plt.savefig('tree_cm_plot.png') plt.show()

圖中橫坐標是predict label，縱坐標是true label；

knn_cm_plot = plot_confusion_matrix(knn_matrix,classes =classes, title='KNN') plt.savefig('knn_cm_plot.png') plt.show()

lr_cm_plot = plot_confusion_matrix(lr_matrix,classes =classes, title='Logistic Regression') plt.savefig('lr_cm_plot.png') plt.show()

svm_cm_plot = plot_confusion_matrix(svm_matrix,classes =classes, title='SVM') plt.savefig('svm_cm_plot.png') plt.show()

rf_cm_plot = plot_confusion_matrix(rf_matrix,classes =classes, title='Random Forest') plt.savefig('rf_cm_plot.png') plt.show()

xgb_cm_plot = plot_confusion_matrix(xgb_matrix,classes =classes, title='XGBoost') plt.savefig('xgb_cm_plot.png') plt.show()

總結

以上是生活随笔為你收集整理的机器学习之信用卡欺诈检测的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： vue2微信授权登录
下一篇： Axure RP9 中继器的简单操作（1