日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

[Kesci] 预测分析 · 客户购买预测(AUC评估要使用predict_proba)

發布時間:2024/7/5 编程问答 35 豆豆
生活随笔 收集整理的這篇文章主要介紹了 [Kesci] 预测分析 · 客户购买预测(AUC评估要使用predict_proba) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

    • 1. Baseline
    • 2. AUC評估要使用predict_proba
      • 2.1 導入工具包
      • 2.2 特征提取
      • 2.3 訓練+模型選擇
      • 2.4 網格/隨機搜索 參數+提交
      • 2.5 測試結果
    • 3. 致謝

新人賽地址

1. Baseline

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit from sklearn.impute import SimpleImputer from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import LabelBinarizer from sklearn.base import BaseEstimator, TransformerMixin from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.pipeline import FeatureUnion from sklearn.model_selection import GridSearchCV from sklearn.model_selection import cross_val_scoretrain = pd.read_csv("./train_set.csv") test = pd.read_csv("./test_set.csv") train.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 25317 entries, 0 to 25316 Data columns (total 18 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 25317 non-null int64 1 age 25317 non-null int64 2 job 25317 non-null object3 marital 25317 non-null object4 education 25317 non-null object5 default 25317 non-null object6 balance 25317 non-null int64 7 housing 25317 non-null object8 loan 25317 non-null object9 contact 25317 non-null object10 day 25317 non-null int64 11 month 25317 non-null object12 duration 25317 non-null int64 13 campaign 25317 non-null int64 14 pdays 25317 non-null int64 15 previous 25317 non-null int64 16 poutcome 25317 non-null object17 y 25317 non-null int64 dtypes: int64(9), object(9) memory usage: 3.5+ MB NO字段名稱數據類型字段描述
1IDInt客戶唯一標識
2ageInt客戶年齡
3jobString客戶的職業
4maritalString婚姻狀況
5educationString受教育水平
6defaultString是否有違約記錄
7balanceInt每年賬戶的平均余額
8housingString是否有住房貸款
9loanString是否有個人貸款
10contactString與客戶聯系的溝通方式
11dayInt最后一次聯系的時間(幾號)
12monthString最后一次聯系的時間(月份)
13durationInt最后一次聯系的交流時長
14campaignInt在本次活動中,與該客戶交流過的次數
15pdaysInt距離上次活動最后一次聯系該客戶,過去了多久(999表示沒有聯系過)
16previousInt在本次活動之前,與該客戶交流過的次數
17poutcomeString上一次活動的結果
18yInt預測客戶是否會訂購定期存款業務
  • 相關系數
abs(train.corr()['y']).sort_values(ascending=False) y 1.000000 ID 0.556627 duration 0.394746 pdays 0.107565 previous 0.088337 campaign 0.075173 balance 0.057564 day 0.031886 age 0.029916 Name: y, dtype: float64
  • 繪制數字特征分布圖
s = (train.dtypes == 'object') object_col = list(s[s].index) object_col num_col = list(set(train.columns) - set(object_col))plt.figure(figsize=(25,22)) for (i,col) in enumerate(num_col):plt.subplot(3,3,i+1)sns.distplot(train[col]) # kde=False 可不顯示密度線plt.xlabel(col,size=20) plt.show()

  • 分析下訓練集 y 標簽的比例
len(train[train['y']==1])/len(train['y']) 0.11695698542481336

只有 11% 的人會購買

  • 新人賽,數據沒有缺失的,直接用模型試試效果
X_train = train.drop(['ID','y'], axis=1) X_test = test.drop(['ID'], axis=1) y_train = train['y'] def num_cat_splitor(X_train):s = (X_train.dtypes == 'object')object_cols = list(s[s].index)num_cols = list(set(X_train.columns) - set(object_cols))return num_cols, object_cols num_cols, object_cols = num_cat_splitor(X_train) # 查看文字變量的種類 for col in object_col:print(col, sorted(train[col].unique()))print(col, sorted(test[col].unique())) class DataFrameSelector(BaseEstimator, TransformerMixin):def __init__(self, attribute_names):self.attribute_names = attribute_namesdef fit(self, X, y=None):return selfdef transform(self, X):return X[self.attribute_names].valuesnum_pipeline = Pipeline([('selector', DataFrameSelector(num_cols)),#('imputer', SimpleImputer(strategy="median")),('std_scaler', StandardScaler()),]) cat_pipeline = Pipeline([('selector', DataFrameSelector(object_cols)),('cat_encoder', OneHotEncoder(sparse=False,handle_unknown='ignore')),]) full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline),("cat_pipeline", cat_pipeline),]) X_prepared = full_pipeline.fit_transform(X_train) from sklearn.ensemble import RandomForestClassifierprepare_select_and_predict_pipeline = Pipeline([('preparation', full_pipeline),('forst_reg', RandomForestClassifier(random_state=0)) ]) param_grid = [{'forst_reg__n_estimators' : [50,100, 150, 200,250,300,330,350],'forst_reg__max_features':[45,50, 55, 65] }]grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=7,scoring='roc_auc', verbose=2, n_jobs=-1) grid_search_prep.fit(X_train,y_train) grid_search_prep.best_params_ final_model = grid_search_prep.best_estimator_ y_pred_test = final_model.predict(X_test) # AUC 評估準則,需要使用 predict_proba,這里錯誤!!! result = pd.DataFrame() result['ID'] = test['ID'] result['pred'] = y_pred_test result.to_csv('buy_product_pred.csv',index=False)

排名結果

auc 得分:0.72439844

2. AUC評估要使用predict_proba

AUC 指標,在預測時,應該使用概率來預測,上面做法是錯誤的(未使用概率預測)。

  • 機器學習之分類器性能指標之ROC曲線、AUC值 https://www.cnblogs.com/dlml/p/4403482.html
  • 如何理解機器學習和統計中的AUC? https://www.zhihu.com/question/39840928
  • sklearn.metrics.roc_auc_score 介紹

AUC 評估模型的優點,在模型正負樣本比例失衡的情況下,依然可以很好的評估模型

以下重新對代碼進行優化

2.1 導入工具包

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline plt.rcParams['figure.facecolor']=(1,1,1,1) # pycharm 繪圖白底,看得清坐標 from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit from sklearn.impute import SimpleImputer from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import LabelBinarizer from sklearn.base import BaseEstimator, TransformerMixin from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.pipeline import FeatureUnion from sklearn.model_selection import GridSearchCV from sklearn.model_selection import cross_val_scoretrain = pd.read_csv("./train_set.csv") test = pd.read_csv("./test_set.csv")

2.2 特征提取

  • 查看文字特征的值
for col in object_col:print(col, sorted(train[col].unique()))print(col, sorted(test[col].unique())) job ['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown'] job ['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown'] marital ['divorced', 'married', 'single'] marital ['divorced', 'married', 'single'] education ['primary', 'secondary', 'tertiary', 'unknown'] education ['primary', 'secondary', 'tertiary', 'unknown'] default ['no', 'yes'] default ['no', 'yes'] housing ['no', 'yes'] housing ['no', 'yes'] loan ['no', 'yes'] loan ['no', 'yes'] contact ['cellular', 'telephone', 'unknown'] contact ['cellular', 'telephone', 'unknown'] month ['apr', 'aug', 'dec', 'feb', 'jan', 'jul', 'jun', 'mar', 'may', 'nov', 'oct', 'sep'] month ['apr', 'aug', 'dec', 'feb', 'jan', 'jul', 'jun', 'mar', 'may', 'nov', 'oct', 'sep'] poutcome ['failure', 'other', 'success', 'unknown'] poutcome ['failure', 'other', 'success', 'unknown']
  • 二值特征轉化為 0, 1
# 對 'default','housing','loan' 3列二值(yes,no)特征轉為 0,1 def binaryFeature(data):data['default_']=0data['default_'][data['default']=='yes'] = 1data['housing_']=0data['housing_'][data['housing']=='yes'] = 1data['loan_']=0data['loan_'][data['loan']=='yes'] = 1return data.drop(['default','housing','loan'], axis=1)X_train = binaryFeature(train) X_test = binaryFeature(test)
  • 訓練集數據切分,用于本地測試
X_train = X_train.drop(['ID'], axis=1) X_test = X_test.drop(['ID'], axis=1)# 將訓練集拆分一些出來做驗證, 分層抽樣 from sklearn.model_selection import StratifiedShuffleSplit splt = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1) for train_idx, vaild_idx in splt.split(X_train, X_train['y']):train_part = X_train.loc[train_idx]valid_part = X_train.loc[vaild_idx]# 訓練集拆成兩部分 本地測試 train_part_y = train_part['y'] valid_part_y = valid_part['y'] train_part = train_part.drop(['y'], axis=1) valid_part = valid_part.drop(['y'], axis=1)
  • 特征處理管道
def num_cat_splitor(X_train):s = (X_train.dtypes == 'object')object_cols = list(s[s].index)num_cols = list(set(X_train.columns) - set(object_cols))return num_cols, object_colsnum_cols, object_cols = num_cat_splitor(X_train) num_cols.remove('y')class DataFrameSelector(BaseEstimator, TransformerMixin):def __init__(self, attribute_names):self.attribute_names = attribute_namesdef fit(self, X, y=None):return selfdef transform(self, X):return X[self.attribute_names].valuesnum_pipeline = Pipeline([('selector', DataFrameSelector(num_cols)), # ('imputer', SimpleImputer(strategy="median")), # ('std_scaler', StandardScaler()),]) cat_pipeline = Pipeline([('selector', DataFrameSelector(object_cols)),('cat_encoder', OneHotEncoder(sparse=False, handle_unknown='ignore')),]) full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline),("cat_pipeline", cat_pipeline),])

2.3 訓練+模型選擇

# 本地測試,選模型 from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.metrics import roc_auc_scorerf = RandomForestClassifier() knn = KNeighborsClassifier() lr = LogisticRegression() svc = SVC(probability=True) gbdt = GradientBoostingClassifier()models = [knn, lr, svc, rf, gbdt] param_grid_list = [# knn[{'model__n_neighbors' : [5,15,35,50,100],'model__leaf_size' : [10,20,30,40,50]}],# lr[{'model__penalty' : ['l1', 'l2'],'model__C' : [0.2, 0.5, 1, 1.2, 1.5],'model__max_iter' : [10000]}],# svc[{'model__C' : [0.2, 0.5, 1, 1.2],'model__kernel' : ['rbf']}],# rf[{# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : [200,250,300,330,350],'model__max_features' : [20,30,40,50],'model__max_depth' : [5,7]}],# gbdt[{'model__learning_rate' : [0.1, 0.5],'model__n_estimators' : [130, 200, 300],'model__max_features' : ['sqrt'],'model__max_depth' : [5,7],'model__min_samples_split' : [500,1000,1200],'model__min_samples_leaf' : [60, 100],'model__subsample' : [0.8, 1]}], ]for i, model in enumerate(models):pipe = Pipeline([('preparation', full_pipeline),('model', model)])grid_search = GridSearchCV(pipe, param_grid_list[i], cv=3,scoring='roc_auc', verbose=2, n_jobs=-1)grid_search.fit(train_part, train_part_y)print(grid_search.best_params_)final_model = grid_search.best_estimator_pred = final_model.predict_proba(valid_part)[:,1] # roc 必須使用概率預測print("auc score: ", roc_auc_score(valid_part_y, pred))
  • 注意 AUC 評分標準 要使用predict_proba方法 !!!
Fitting 3 folds for each of 25 candidates, totalling 75 fits {'model__leaf_size': 20, 'model__n_neighbors': 50} auc score: 0.8212256518034133 Fitting 3 folds for each of 10 candidates, totalling 30 fits {'model__C': 1.2, 'model__max_iter': 10000, 'model__penalty': 'l2'} auc score: 0.9011510812019533 Fitting 3 folds for each of 4 candidates, totalling 12 fits {'model__C': 0.2, 'model__kernel': 'rbf'} auc score: 0.7192431208601267 Fitting 3 folds for each of 40 candidates, totalling 120 fits {'model__max_depth': 7, 'model__max_features': 20, 'model__n_estimators': 350} auc score: 0.913398647137746 Fitting 3 folds for each of 144 candidates, totalling 432 fits {'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 60, 'model__min_samples_split': 500, 'model__n_estimators': 300, 'model__subsample': 1} auc score: 0.9299485084368806

可以看見 GBDT 梯度提升下降樹模型表現最好

2.4 網格/隨機搜索 參數+提交

微調參數列表,使用全部的訓練數據訓練,使用 RF 和 GBDT 模型 對測試集進行預測

  • 網格搜索
# 全量訓練,網格搜索,提交 y_train = X_train['y'] X_train_ = X_train.drop(['y'], axis=1)select_model = [rf, gbdt] param_grid_list = [# rf[{# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : [250,300,350,400],'model__max_features' : [7,8,10,15,20],'model__max_depth' : [7,9,10,11]}],# gbdt[{'model__learning_rate' : [0.03, 0.05, 0.1],'model__n_estimators' : [200, 300, 350],'model__max_features' : ['sqrt'],'model__max_depth' : [7,9,11],'model__min_samples_split' : [300, 400, 500],'model__min_samples_leaf' : [50,60,70],'model__subsample' : [0.8, 1, 1.2]}], ]for i, model in enumerate(select_model):pipe = Pipeline([('preparation', full_pipeline),('model', model)])grid_search = GridSearchCV(pipe, param_grid_list[i], cv=3,scoring='roc_auc', verbose=2, n_jobs=-1)grid_search.fit(X_train_, y_train)print(grid_search.best_params_)final_model = grid_search.best_estimator_pred = final_model.predict_proba(X_test)[:,1] # roc 必須使用概率預測print(model,'\n finished!')result = pd.DataFrame()result['ID'] = test['ID']result['pred'] = predresult.to_csv('{}_pred.csv'.format(i), index=False) Fitting 3 folds for each of 80 candidates, totalling 240 fits {'model__max_depth': 11, 'model__max_features': 15, 'model__n_estimators': 400} RandomForestClassifier() finished! Fitting 3 folds for each of 729 candidates, totalling 2187 fits {'model__learning_rate': 0.05, 'model__max_depth': 11, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 50, 'model__min_samples_split': 500, 'model__n_estimators': 300, 'model__subsample': 1} GradientBoostingClassifier() finished!
  • 隨機搜索
# 隨機搜索參數 y_train = X_train['y'] X_train_ = X_train.drop(['y'], axis=1)from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint select_model = [rf, gbdt] param_distribs = [# rf[{# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : randint(low=250, high=500),'model__max_features' : randint(low=10, high=30),'model__max_depth' : randint(low=8, high=20)}],# gbdt[{'model__learning_rate' : np.linspace(0.01, 0.1, 10),'model__n_estimators' : randint(low=250, high=500),'model__max_features' : ['sqrt'],'model__max_depth' : randint(low=8, high=20),'model__min_samples_split' : randint(low=400, high=1000),'model__min_samples_leaf' : randint(low=40, high=80),'model__subsample' : np.linspace(0.5, 1.5, 10)}], ]for i, model in enumerate(select_model):pipe = Pipeline([('preparation', full_pipeline),('model', model)])rand_search = RandomizedSearchCV(pipe, param_distributions=param_distribs[i], cv=3,n_iter=20,scoring='roc_auc', verbose=2, n_jobs=-1)rand_search.fit(X_train_, y_train)print(rand_search.best_params_)final_model = rand_search.best_estimator_pred = final_model.predict_proba(X_test)[:,1] # roc 必須使用概率預測print(model,'\n finished!')result = pd.DataFrame()result['ID'] = test['ID']result['pred'] = predresult.to_csv('{}_pred.csv'.format(i), index=False) Fitting 3 folds for each of 20 candidates, totalling 60 fits {'model__max_depth': 18, 'model__max_features': 13, 'model__n_estimators': 481} RandomForestClassifier() finished! Fitting 3 folds for each of 20 candidates, totalling 60 fits {'model__learning_rate': 0.05000000000000001, 'model__max_depth': 15, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 68, 'model__min_samples_split': 905, 'model__n_estimators': 362, 'model__subsample': 0.9444444444444444} GradientBoostingClassifier() finished!

2.5 測試結果

RF 模型得分:0.9229160811692528


GBDT 模型得分:0.9332932318964199


第二期排名,暫列第8

3. 致謝

感謝徐師兄一直以來的指點和幫助!
歡迎大家一起分享練習心得,一起繼續加油!

總結

以上是生活随笔為你收集整理的[Kesci] 预测分析 · 客户购买预测(AUC评估要使用predict_proba)的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。