LBO验证与LBO_full验证的区别
生活随笔
收集整理的這篇文章主要介紹了
LBO验证与LBO_full验证的区别
小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
LBO_full驗(yàn)證的意思是:
除了留出一份作為特定的驗(yàn)證集以外,
其余所有數(shù)據(jù)丟入訓(xùn)練中,每次訓(xùn)練都使用除了初始設(shè)定的驗(yàn)證集以外的所有數(shù)據(jù)進(jìn)行訓(xùn)練,然后取平均
?
NUMBER_OF_MODELS=3
代碼[1]如下:
def make_predictions2(train_df,test_df,features_columns, target, lgb_params, NFOLDS=2):SEED=42P = test_df[features_columns]#設(shè)定測(cè)試集print('#'*20)print('LBO full set training...') ## We need Divide Train Set by Time blocks## Convert TransactionDT to Months## And use last month as Validation ## to find best roundtrain_df['DT_M'] = train_df['TransactionDT'].apply(lambda x: (START_DATE + datetime.timedelta(seconds = x)))train_df['DT_M'] = (train_df['DT_M'].dt.year-2017)*12 + train_df['DT_M'].dt.month main_train_set = train_df[train_df['DT_M']<(train_df['DT_M'].max())].reset_index(drop=True)#前面幾個(gè)月數(shù)據(jù)集作為訓(xùn)練集validation_set = train_df[train_df['DT_M']==train_df['DT_M'].max()].reset_index(drop=True) #最后一個(gè)月數(shù)據(jù)集作為驗(yàn)證集#訓(xùn)練集裸數(shù)據(jù)和類(lèi)別標(biāo)簽分開(kāi)X,y = main_train_set[features_columns], main_train_set[TARGET]#五個(gè)月的數(shù)據(jù)作為訓(xùn)練#驗(yàn)證集裸數(shù)據(jù)和類(lèi)別標(biāo)簽分開(kāi)v_X, v_y = validation_set[features_columns], validation_set[TARGET]#最后一個(gè)月的數(shù)據(jù)作為驗(yàn)證#---------------------下面是驗(yàn)證,獲取最佳current_iteration()------------------------------------------------------------for current_model in range(3):print('Model:',current_model+1)SEED += 1seed_everything(SEED)corrected_lgb_params = lgb_params.copy()corrected_lgb_params['seed'] = SEEDtrain_data = lgb.Dataset(X, label=y)valid_data = lgb.Dataset(v_X, label=v_y) estimator = lgb.train(corrected_lgb_params,train_data,valid_sets = [train_data, valid_data],verbose_eval = 1000,)estimators_bestround.append(estimator.current_iteration())#下面根據(jù)#---------------------下面是測(cè)試以及輸出預(yù)測(cè)結(jié)果------------------------------------------------------------corrected_lgb_params = lgb_params.copy()corrected_lgb_params['n_estimators'] = int(np.mean(estimators_bestround))corrected_lgb_params['early_stopping_rounds'] = Noneprint('#'*10)print('Mean Best round:', corrected_lgb_params['n_estimators'])# 所有訓(xùn)練數(shù)據(jù)X,y = train_df[features_columns], train_df[TARGET]# 測(cè)試數(shù)據(jù)P = test_df[features_columns]RESULTS['lbo_full'] = 0NUMBER_OF_MODELS = 3for current_model in range(NUMBER_OF_MODELS):print('Model:',current_model+1)SEED += 1seed_everything(SEED) train_data = lgb.Dataset(X, label=y)estimator = lgb.train(corrected_lgb_params,train_data)RESULTS['prediction'] += estimator.predict(P)/NUMBER_OF_MODELSreturn RESULTS?
LBO驗(yàn)證的意思是:
除了留出一份作為特定的驗(yàn)證集以外(這里與上面相同),其余進(jìn)行交叉驗(yàn)證(N_SPLITS=3)
print('#'*20) print('LBO training...') ## We need Divide Train Set by Time blocks ## Convert TransactionDT to Months ## And use last month as Validation train_df['DT_M'] = train_df['TransactionDT'].apply(lambda x: (START_DATE + datetime.timedelta(seconds = x))) train_df['DT_M'] = (train_df['DT_M'].dt.year-2017)*12 + train_df['DT_M'].dt.month main_train_set = train_df[train_df['DT_M']<(train_df['DT_M'].max())].reset_index(drop=True) validation_set = train_df[train_df['DT_M']==train_df['DT_M'].max()].reset_index(drop=True)## We will use oof kfold to find "best round" folds = KFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)# Main Data X,y = main_train_set[features_columns], main_train_set[TARGET]# Validation Data v_X, v_y = validation_set[features_columns], validation_set[TARGET]estimators_bestround = [] for fold_, (trn_idx, val_idx) in enumerate(folds.split(X, y)):print('Fold:',fold_+1)tr_x, tr_y = X.iloc[trn_idx,:], y[trn_idx] train_data = lgb.Dataset(tr_x, label=tr_y)valid_data = lgb.Dataset(v_X, label=v_y) estimator = lgb.train(lgb_params,train_data,valid_sets = [train_data, valid_data],verbose_eval = 1000,)estimators_bestround.append(estimator.current_iteration())## Now we have "mean Best round" and we can train model on full set corrected_lgb_params = lgb_params.copy() corrected_lgb_params['n_estimators'] = int(np.mean(estimators_bestround)) corrected_lgb_params['early_stopping_rounds'] = None print('#'*10) print('Mean Best round:', corrected_lgb_params['n_estimators'])# Main Data X,y = train_df[features_columns], train_df[TARGET]# Test Data P = test_df[features_columns] RESULTS['lbo'] = 0for fold_, (trn_idx, val_idx) in enumerate(folds.split(X, y)):print('Fold:',fold_+1)tr_x, tr_y = X.iloc[trn_idx,:], y[trn_idx]train_data = lgb.Dataset(tr_x, label=tr_y)estimator = lgb.train(corrected_lgb_params,train_data)RESULTS['lbo'] += estimator.predict(P)/N_SPLITS#這里的P其實(shí)是驗(yàn)證集print('AUC score', metrics.roc_auc_score(RESULTS[TARGET], RESULTS['lbo'])) print('#'*20)?
?
注意事項(xiàng):
根據(jù)[2],不要在特征工程階段,把驗(yàn)證集也包括進(jìn)來(lái),因?yàn)檫@會(huì)導(dǎo)致leakage.
?
Reference:
[1]https://www.kaggle.com/kyakovlev/ieee-cv-options
[2]https://www.kaggle.com/c/ieee-fraud-detection/discussion/107728#latest-627879
總結(jié)
以上是生活随笔為你收集整理的LBO验证与LBO_full验证的区别的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: kaggle的discussion区都是
- 下一篇: numpy中的clip函數的用法