當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

[Kesci] 预测分析 · 客户购买预测（AUC评估要使用predict_proba）

發布時間：2024/7/5 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 [Kesci] 预测分析 · 客户购买预测（AUC评估要使用predict_proba）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

- 1. Baseline
- 2. AUC評估要使用predict_proba
- - 2.1 導入工具包
  - 2.2 特征提取
  - 2.3 訓練+模型選擇
  - 2.4 網格/隨機搜索參數+提交
  - 2.5 測試結果
- 3. 致謝

新人賽地址

1. Baseline

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit from sklearn.impute import SimpleImputer from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import LabelBinarizer from sklearn.base import BaseEstimator, TransformerMixin from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.pipeline import FeatureUnion from sklearn.model_selection import GridSearchCV from sklearn.model_selection import cross_val_scoretrain = pd.read_csv("./train_set.csv") test = pd.read_csv("./test_set.csv") train.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 25317 entries, 0 to 25316 Data columns (total 18 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 25317 non-null int64 1 age 25317 non-null int64 2 job 25317 non-null object3 marital 25317 non-null object4 education 25317 non-null object5 default 25317 non-null object6 balance 25317 non-null int64 7 housing 25317 non-null object8 loan 25317 non-null object9 contact 25317 non-null object10 day 25317 non-null int64 11 month 25317 non-null object12 duration 25317 non-null int64 13 campaign 25317 non-null int64 14 pdays 25317 non-null int64 15 previous 25317 non-null int64 16 poutcome 25317 non-null object17 y 25317 non-null int64 dtypes: int64(9), object(9) memory usage: 3.5+ MB NO字段名稱數據類型字段描述

1	ID	Int	客戶唯一標識
2	age	Int	客戶年齡
3	job	String	客戶的職業
4	marital	String	婚姻狀況
5	education	String	受教育水平
6	default	String	是否有違約記錄
7	balance	Int	每年賬戶的平均余額
8	housing	String	是否有住房貸款
9	loan	String	是否有個人貸款
10	contact	String	與客戶聯系的溝通方式
11	day	Int	最后一次聯系的時間（幾號）
12	month	String	最后一次聯系的時間（月份）
13	duration	Int	最后一次聯系的交流時長
14	campaign	Int	在本次活動中，與該客戶交流過的次數
15	pdays	Int	距離上次活動最后一次聯系該客戶，過去了多久（999表示沒有聯系過）
16	previous	Int	在本次活動之前，與該客戶交流過的次數
17	poutcome	String	上一次活動的結果
18	y	Int	預測客戶是否會訂購定期存款業務

相關系數

abs(train.corr()['y']).sort_values(ascending=False) y 1.000000 ID 0.556627 duration 0.394746 pdays 0.107565 previous 0.088337 campaign 0.075173 balance 0.057564 day 0.031886 age 0.029916 Name: y, dtype: float64

繪制數字特征分布圖

s = (train.dtypes == 'object') object_col = list(s[s].index) object_col num_col = list(set(train.columns) - set(object_col))plt.figure(figsize=(25,22)) for (i,col) in enumerate(num_col):plt.subplot(3,3,i+1)sns.distplot(train[col]) # kde=False 可不顯示密度線plt.xlabel(col,size=20) plt.show()

分析下訓練集 y 標簽的比例

len(train[train['y']==1])/len(train['y']) 0.11695698542481336

只有 11% 的人會購買

新人賽，數據沒有缺失的，直接用模型試試效果

X_train = train.drop(['ID','y'], axis=1) X_test = test.drop(['ID'], axis=1) y_train = train['y'] def num_cat_splitor(X_train):s = (X_train.dtypes == 'object')object_cols = list(s[s].index)num_cols = list(set(X_train.columns) - set(object_cols))return num_cols, object_cols num_cols, object_cols = num_cat_splitor(X_train) # 查看文字變量的種類 for col in object_col:print(col, sorted(train[col].unique()))print(col, sorted(test[col].unique())) class DataFrameSelector(BaseEstimator, TransformerMixin):def __init__(self, attribute_names):self.attribute_names = attribute_namesdef fit(self, X, y=None):return selfdef transform(self, X):return X[self.attribute_names].valuesnum_pipeline = Pipeline([('selector', DataFrameSelector(num_cols)),#('imputer', SimpleImputer(strategy="median")),('std_scaler', StandardScaler()),]) cat_pipeline = Pipeline([('selector', DataFrameSelector(object_cols)),('cat_encoder', OneHotEncoder(sparse=False,handle_unknown='ignore')),]) full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline),("cat_pipeline", cat_pipeline),]) X_prepared = full_pipeline.fit_transform(X_train) from sklearn.ensemble import RandomForestClassifierprepare_select_and_predict_pipeline = Pipeline([('preparation', full_pipeline),('forst_reg', RandomForestClassifier(random_state=0)) ]) param_grid = [{'forst_reg__n_estimators' : [50,100, 150, 200,250,300,330,350],'forst_reg__max_features':[45,50, 55, 65] }]grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=7,scoring='roc_auc', verbose=2, n_jobs=-1) grid_search_prep.fit(X_train,y_train) grid_search_prep.best_params_ final_model = grid_search_prep.best_estimator_ y_pred_test = final_model.predict(X_test) # AUC 評估準則，需要使用 predict_proba，這里錯誤！！！ result = pd.DataFrame() result['ID'] = test['ID'] result['pred'] = y_pred_test result.to_csv('buy_product_pred.csv',index=False)

排名結果

auc 得分：0.72439844

2. AUC評估要使用predict_proba

AUC 指標，在預測時，應該使用概率來預測，上面做法是錯誤的（未使用概率預測）。

機器學習之分類器性能指標之ROC曲線、AUC值 https://www.cnblogs.com/dlml/p/4403482.html
如何理解機器學習和統計中的AUC？ https://www.zhihu.com/question/39840928
sklearn.metrics.roc_auc_score 介紹

AUC 評估模型的優點，在模型正負樣本比例失衡的情況下，依然可以很好的評估模型

以下重新對代碼進行優化

2.1 導入工具包

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline plt.rcParams['figure.facecolor']=(1,1,1,1) # pycharm 繪圖白底，看得清坐標 from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit from sklearn.impute import SimpleImputer from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import LabelBinarizer from sklearn.base import BaseEstimator, TransformerMixin from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.pipeline import FeatureUnion from sklearn.model_selection import GridSearchCV from sklearn.model_selection import cross_val_scoretrain = pd.read_csv("./train_set.csv") test = pd.read_csv("./test_set.csv")

2.2 特征提取

查看文字特征的值

for col in object_col:print(col, sorted(train[col].unique()))print(col, sorted(test[col].unique())) job ['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown'] job ['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown'] marital ['divorced', 'married', 'single'] marital ['divorced', 'married', 'single'] education ['primary', 'secondary', 'tertiary', 'unknown'] education ['primary', 'secondary', 'tertiary', 'unknown'] default ['no', 'yes'] default ['no', 'yes'] housing ['no', 'yes'] housing ['no', 'yes'] loan ['no', 'yes'] loan ['no', 'yes'] contact ['cellular', 'telephone', 'unknown'] contact ['cellular', 'telephone', 'unknown'] month ['apr', 'aug', 'dec', 'feb', 'jan', 'jul', 'jun', 'mar', 'may', 'nov', 'oct', 'sep'] month ['apr', 'aug', 'dec', 'feb', 'jan', 'jul', 'jun', 'mar', 'may', 'nov', 'oct', 'sep'] poutcome ['failure', 'other', 'success', 'unknown'] poutcome ['failure', 'other', 'success', 'unknown']

二值特征轉化為 0， 1

# 對 'default','housing','loan' 3列二值(yes,no)特征轉為 0，1 def binaryFeature(data):data['default_']=0data['default_'][data['default']=='yes'] = 1data['housing_']=0data['housing_'][data['housing']=='yes'] = 1data['loan_']=0data['loan_'][data['loan']=='yes'] = 1return data.drop(['default','housing','loan'], axis=1)X_train = binaryFeature(train) X_test = binaryFeature(test)

訓練集數據切分，用于本地測試

X_train = X_train.drop(['ID'], axis=1) X_test = X_test.drop(['ID'], axis=1)# 將訓練集拆分一些出來做驗證, 分層抽樣 from sklearn.model_selection import StratifiedShuffleSplit splt = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1) for train_idx, vaild_idx in splt.split(X_train, X_train['y']):train_part = X_train.loc[train_idx]valid_part = X_train.loc[vaild_idx]# 訓練集拆成兩部分本地測試 train_part_y = train_part['y'] valid_part_y = valid_part['y'] train_part = train_part.drop(['y'], axis=1) valid_part = valid_part.drop(['y'], axis=1)

特征處理管道

def num_cat_splitor(X_train):s = (X_train.dtypes == 'object')object_cols = list(s[s].index)num_cols = list(set(X_train.columns) - set(object_cols))return num_cols, object_colsnum_cols, object_cols = num_cat_splitor(X_train) num_cols.remove('y')class DataFrameSelector(BaseEstimator, TransformerMixin):def __init__(self, attribute_names):self.attribute_names = attribute_namesdef fit(self, X, y=None):return selfdef transform(self, X):return X[self.attribute_names].valuesnum_pipeline = Pipeline([('selector', DataFrameSelector(num_cols)), # ('imputer', SimpleImputer(strategy="median")), # ('std_scaler', StandardScaler()),]) cat_pipeline = Pipeline([('selector', DataFrameSelector(object_cols)),('cat_encoder', OneHotEncoder(sparse=False, handle_unknown='ignore')),]) full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline),("cat_pipeline", cat_pipeline),])

2.3 訓練+模型選擇

# 本地測試，選模型 from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.metrics import roc_auc_scorerf = RandomForestClassifier() knn = KNeighborsClassifier() lr = LogisticRegression() svc = SVC(probability=True) gbdt = GradientBoostingClassifier()models = [knn, lr, svc, rf, gbdt] param_grid_list = [# knn[{'model__n_neighbors' : [5,15,35,50,100],'model__leaf_size' : [10,20,30,40,50]}],# lr[{'model__penalty' : ['l1', 'l2'],'model__C' : [0.2, 0.5, 1, 1.2, 1.5],'model__max_iter' : [10000]}],# svc[{'model__C' : [0.2, 0.5, 1, 1.2],'model__kernel' : ['rbf']}],# rf[{# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : [200,250,300,330,350],'model__max_features' : [20,30,40,50],'model__max_depth' : [5,7]}],# gbdt[{'model__learning_rate' : [0.1, 0.5],'model__n_estimators' : [130, 200, 300],'model__max_features' : ['sqrt'],'model__max_depth' : [5,7],'model__min_samples_split' : [500,1000,1200],'model__min_samples_leaf' : [60, 100],'model__subsample' : [0.8, 1]}], ]for i, model in enumerate(models):pipe = Pipeline([('preparation', full_pipeline),('model', model)])grid_search = GridSearchCV(pipe, param_grid_list[i], cv=3,scoring='roc_auc', verbose=2, n_jobs=-1)grid_search.fit(train_part, train_part_y)print(grid_search.best_params_)final_model = grid_search.best_estimator_pred = final_model.predict_proba(valid_part)[:,1] # roc 必須使用概率預測print("auc score: ", roc_auc_score(valid_part_y, pred))

注意 AUC 評分標準要使用predict_proba方法！！！

Fitting 3 folds for each of 25 candidates, totalling 75 fits {'model__leaf_size': 20, 'model__n_neighbors': 50} auc score: 0.8212256518034133 Fitting 3 folds for each of 10 candidates, totalling 30 fits {'model__C': 1.2, 'model__max_iter': 10000, 'model__penalty': 'l2'} auc score: 0.9011510812019533 Fitting 3 folds for each of 4 candidates, totalling 12 fits {'model__C': 0.2, 'model__kernel': 'rbf'} auc score: 0.7192431208601267 Fitting 3 folds for each of 40 candidates, totalling 120 fits {'model__max_depth': 7, 'model__max_features': 20, 'model__n_estimators': 350} auc score: 0.913398647137746 Fitting 3 folds for each of 144 candidates, totalling 432 fits {'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 60, 'model__min_samples_split': 500, 'model__n_estimators': 300, 'model__subsample': 1} auc score: 0.9299485084368806

可以看見 GBDT 梯度提升下降樹模型表現最好

2.4 網格/隨機搜索參數+提交

微調參數列表，使用全部的訓練數據訓練，使用 RF 和 GBDT 模型對測試集進行預測

網格搜索

# 全量訓練，網格搜索，提交 y_train = X_train['y'] X_train_ = X_train.drop(['y'], axis=1)select_model = [rf, gbdt] param_grid_list = [# rf[{# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : [250,300,350,400],'model__max_features' : [7,8,10,15,20],'model__max_depth' : [7,9,10,11]}],# gbdt[{'model__learning_rate' : [0.03, 0.05, 0.1],'model__n_estimators' : [200, 300, 350],'model__max_features' : ['sqrt'],'model__max_depth' : [7,9,11],'model__min_samples_split' : [300, 400, 500],'model__min_samples_leaf' : [50,60,70],'model__subsample' : [0.8, 1, 1.2]}], ]for i, model in enumerate(select_model):pipe = Pipeline([('preparation', full_pipeline),('model', model)])grid_search = GridSearchCV(pipe, param_grid_list[i], cv=3,scoring='roc_auc', verbose=2, n_jobs=-1)grid_search.fit(X_train_, y_train)print(grid_search.best_params_)final_model = grid_search.best_estimator_pred = final_model.predict_proba(X_test)[:,1] # roc 必須使用概率預測print(model,'\n finished!')result = pd.DataFrame()result['ID'] = test['ID']result['pred'] = predresult.to_csv('{}_pred.csv'.format(i), index=False) Fitting 3 folds for each of 80 candidates, totalling 240 fits {'model__max_depth': 11, 'model__max_features': 15, 'model__n_estimators': 400} RandomForestClassifier() finished! Fitting 3 folds for each of 729 candidates, totalling 2187 fits {'model__learning_rate': 0.05, 'model__max_depth': 11, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 50, 'model__min_samples_split': 500, 'model__n_estimators': 300, 'model__subsample': 1} GradientBoostingClassifier() finished!

隨機搜索

# 隨機搜索參數 y_train = X_train['y'] X_train_ = X_train.drop(['y'], axis=1)from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint select_model = [rf, gbdt] param_distribs = [# rf[{# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : randint(low=250, high=500),'model__max_features' : randint(low=10, high=30),'model__max_depth' : randint(low=8, high=20)}],# gbdt[{'model__learning_rate' : np.linspace(0.01, 0.1, 10),'model__n_estimators' : randint(low=250, high=500),'model__max_features' : ['sqrt'],'model__max_depth' : randint(low=8, high=20),'model__min_samples_split' : randint(low=400, high=1000),'model__min_samples_leaf' : randint(low=40, high=80),'model__subsample' : np.linspace(0.5, 1.5, 10)}], ]for i, model in enumerate(select_model):pipe = Pipeline([('preparation', full_pipeline),('model', model)])rand_search = RandomizedSearchCV(pipe, param_distributions=param_distribs[i], cv=3,n_iter=20,scoring='roc_auc', verbose=2, n_jobs=-1)rand_search.fit(X_train_, y_train)print(rand_search.best_params_)final_model = rand_search.best_estimator_pred = final_model.predict_proba(X_test)[:,1] # roc 必須使用概率預測print(model,'\n finished!')result = pd.DataFrame()result['ID'] = test['ID']result['pred'] = predresult.to_csv('{}_pred.csv'.format(i), index=False) Fitting 3 folds for each of 20 candidates, totalling 60 fits {'model__max_depth': 18, 'model__max_features': 13, 'model__n_estimators': 481} RandomForestClassifier() finished! Fitting 3 folds for each of 20 candidates, totalling 60 fits {'model__learning_rate': 0.05000000000000001, 'model__max_depth': 15, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 68, 'model__min_samples_split': 905, 'model__n_estimators': 362, 'model__subsample': 0.9444444444444444} GradientBoostingClassifier() finished!

2.5 測試結果

RF 模型得分：0.9229160811692528

GBDT 模型得分：0.9332932318964199

第二期排名，暫列第8

3. 致謝

感謝徐師兄一直以來的指點和幫助！
歡迎大家一起分享練習心得，一起繼續加油！

總結

以上是生活随笔為你收集整理的[Kesci] 预测分析 · 客户购买预测（AUC评估要使用predict_proba）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： LeetCode MySQL 1412.
下一篇： LeetCode 431. 将 N 叉树