【数据竞赛】竞赛宝典黑科技:基于开源结果的高端融合策略
作者: 塵沙杰少,櫻落
競賽寶典黑科技_基于開源結果的融合
(輕輕松松挖銀牌)
背景
本篇文章的思想很簡單,不需要自己跑任何的模型,只需要將現有的開源提交結果進行“直接優化兩步走”即可拿到比所有開源結果更好的方案,有一些kaggle競賽懶人選手就是直接通過此種策略在最后三天直接拿下銀牌.......
模型融合兩步走
1. 基礎融合
收集所有開源社區的提交結果(假設有N個結果,);
按照所有開源結果的分數進行排序(由低到高),();
取前M個較低的結果進行某種方式的集成得到結果, 于是我們的結果變為: ();
然后我們選取與分數相近的結果進行集成;依次進行直到所有結果集成完畢。
2. 基礎融合升級
拿到基礎融合的結果,再依次對結果進行修正。(細節可以看下面的案例)
屢比屢大,則乘上大于1的系數;屢比屢小,則乘上小于1的系數;
案例
該案例摘錄于:《kaggle:[results-driven] Tabular Playground Series - 201》:https://www.kaggle.com/somayyehgholami/results-driven-tabular-playground-series-201
1. 收集開源提交結果
import?pandas?as?pd? import?matplotlib.pyplot?as?plt%matplotlib?inline? dfk?=?pd.DataFrame({?'Kernel?ID':?['A',?'B',?'C',?'D',?'E',?'F',?'G',?'H',?'I',?'J',?'K'],??'Score':?????[?0.69864?,?0.69846?,?0.69836?,?0.69824?,?0.69813,?0.69795,?0.69782,?0.69749,?0.69747,?0.69735,?0.69731],???'File?Path':?['../input/tps-jan-2021-gbdts-baseline/submission.csv',?'../input/pseudo-labelling/submission.csv',?'../input/v4-baseline-lgb-no-tune/sub_0.6971.csv',?'../input/tps21-optuna-lgb-fast-hyper-parameter-tunning/submission.csv',?'../input/gbdts-baseline-prevision-io-for-free/submission.csv',?'../input/v41-eda-gbdts/res41.csv',?'../input/v3-ensemble-lgb-xgb-cat/submission.csv'?,?'../input/tabular-playground/sub_gbm.csv',?'../input/v48tabular-playground-series-xgboost-lightgbm/V48-0.69747.csv',?'../input/xgboost-hyperparameter-tuning-using-optuna/submission.csv',?'../input/tabular-playground-some-slightly-useful-features/sub_gbm.csv']????? })????dfk????????| A | 0.69864 | ../input/tps-jan-2021-gbdts-baseline/submissio... |
| B | 0.69846 | ../input/pseudo-labelling/submission.csv |
| C | 0.69836 | ../input/v4-baseline-lgb-no-tune/sub_0.6971.csv |
| D | 0.69824 | ../input/tps21-optuna-lgb-fast-hyper-parameter... |
| E | 0.69813 | ../input/gbdts-baseline-prevision-io-for-free/... |
| F | 0.69795 | ../input/v41-eda-gbdts/res41.csv |
| G | 0.69782 | ../input/v3-ensemble-lgb-xgb-cat/submission.csv |
| H | 0.69749 | ../input/tabular-playground/sub_gbm.csv |
| I | 0.69747 | ../input/v48tabular-playground-series-xgboost-... |
| J | 0.69735 | ../input/xgboost-hyperparameter-tuning-using-o... |
| K | 0.69731 | ../input/tabular-playground-some-slightly-usef... |
2. 結果融合函數
用線上效果好的結果coeff + 線上效果差一些的結果(1-coeff), coeff一般是大于0.5的
| A | 0.69864 | ../input/tps-jan-2021-gbdts-baseline/submissio... |
| B | 0.69846 | ../input/pseudo-labelling/submission.csv |
| C | 0.69836 | ../input/v4-baseline-lgb-no-tune/sub_0.6971.csv |
| D | 0.69824 | ../input/tps21-optuna-lgb-fast-hyper-parameter... |
| E | 0.69813 | ../input/gbdts-baseline-prevision-io-for-free/... |
| F | 0.69795 | ../input/v41-eda-gbdts/res41.csv |
| G | 0.69782 | ../input/v3-ensemble-lgb-xgb-cat/submission.csv |
| H | 0.69749 | ../input/tabular-playground/sub_gbm.csv |
| I | 0.69747 | ../input/v48tabular-playground-series-xgboost-... |
| J | 0.69735 | ../input/xgboost-hyperparameter-tuning-using-o... |
| K | 0.69731 | ../input/tabular-playground-some-slightly-usef... |
3. 結果融合初步
3.1 融合1:通過A-G -> 最優結果1
[ A: (Score: 0.69864), B: (Score: 0.69846), ... , F: (Score: 0.69795), G: (Score: 0.69782) ] >>> sub1: (Score: 0.69781)
3.2 融合2:使用融合結果1以及差不大的分數融合
注意線上結果好的sub是main,次優的是support;
[ H: (Score: 0.69749) , sub1: (Score: 0.69781) ] >>> sub2: (Score: 更好了)
3.3 融合3:使用融合結果2以及差不大的分數融合
[ I: (Score: 0.69747) , sub2: (Score: -----) ] >>> sub3: (Score: 更好了)
3.4 融合4:使用融合結果3以及差不大的分數融合
[ J: (Score: 0.69735) , sub3: (Score: ------) ] >>> sub4: (Score: 更好了)
3.5 融合5:使用融合結果4以及差不大的分數融合
[ k: (Score: 0.69731) , sub4: (Score: -------) ] >>> sub5: (Score: 0.69688)
4. 結果融合升級
對預測結果偏低的糾正,對預測結果偏高的糾正
sub5: (Score: 0.69688) >>> sub6: (Score: 0.69682)
We first compared the result of our previous step with the results of each kernel used. We looked for rows where the results of all kernels (or the majority of kernels) differed from the results of our previous step (more or less). On the other hand, we know that the results of the previous step are better than the results of all the kernels used. So we can guess that these rows have been oppressed !!! That is, in the previous steps, they were mistakenly increased or decreased. We compensate for these possible errors to some extent by applying the coefficients "pcoeff" and "mcoeff" (of course, only in these rows). Fortunately, the pictures illustrate the method well.
main?=?sub5??#0.69688 comp?=?main.copy()majority?=?9????????#?Hyper?parameter pcoeff???=?1.0016???#?Hyper?parameter mcoeff???=?0.9984???#?Hyper?parameterpxy?=?[[],[],[]] mxy?=?[[],[],[]]for?i?in?main.columns[1:]:????lm???=?main[i].tolist()?ls???=?[[],[],[],[],[],[],[],[],[],[],[]]res??=?[]##?1.?讀取所有的開源結果for?n?in?range?(11):???????csv???=?pd.read_csv(dfk.iloc[n,?2])??ls[n]?=?csv[i].tolist()?##?2.?for?j?in?range(len(main)):??pcount?=?0pvalue?=?0.0mcount?=?0mvalue?=?0.0?##?2.1?統計main的結果大于ls的次數,用pcount記錄##?????統計main的結果小于ls的次數,用mcount記錄##?2.2?pcount的次數大于一個閾值,那么我們的main的結果乘上一個系數(一般大于1)##?????mcount的次數大于某個閾值,那么我們的main的結果乘上一個系數(一般小于1)for?k?in?range?(11):??if?lm[j]?>?ls[k][j]:pcount?+=?1pvalue?+=?ls[k][j]?????????????????else:?mcount?+=?1mvalue?+=?ls[k][j]?if?(pcount?>?majority):?res.append(lm[j]?*?pcoeff)pxy[0].append(lm[j])pxy[1].append(pvalue?/?pcount)pxy[2].append(lm[j]??*?pcoeff)elif?(mcount?>?majority):?res.append(lm[j]?*?mcoeff)mxy[0].append(lm[j])mxy[1].append(mvalue?/?mcount)mxy[2].append(lm[j]??*?mcoeff)else:?res.append(lm[j])???????comp[i]?=?ressub6?=?comp?
往期精彩回顧適合初學者入門人工智能的路線及資料下載機器學習及深度學習筆記等資料打印機器學習在線手冊深度學習筆記專輯《統計學習方法》的代碼復現專輯 AI基礎下載機器學習的數學基礎專輯 本站知識星球“黃博的機器學習圈子”(92416895) 本站qq群704220115。 加入微信群請掃碼:總結
以上是生活随笔為你收集整理的【数据竞赛】竞赛宝典黑科技:基于开源结果的高端融合策略的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: QQ邮箱怎么发送文件夹 怎样在QQ邮箱里
- 下一篇: 如何解决Win11开始菜单无法固定的问题