日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 >

Adversarial Validation 微软恶意代码比赛的一个kenel的解析

發布時間:2025/3/15 31 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Adversarial Validation 微软恶意代码比赛的一个kenel的解析 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

英文文檔鏈接🔗
比賽網址🔗

對抗性驗證(Adversarial Validation)的作用


生成與待分類數據集同分布的新數據集并當作驗證集,這樣子訓練出來的模型在待分類數據集中的分類效果更好。

AUC簡介


最后得到的模型的對新數據的預測結果的AUC值越大,說明這個分類模型的分類能力越好。

項目詳解


代碼:
??

import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt import seaborn as sns import lightgbm as lgb from sklearn.model_selection import train_test_split from sklearn import model_selection, preprocessing, metrics import warnings import datetime warnings.filterwarnings("ignore") import os import gc print(os.listdir("../input")) print(os.listdir("../input/microsoft-malware-prediction")) print(os.listdir("../input/malware-feature-engineering-full-train-and-test/"))

輸出:
[‘microsoft-malware-prediction’, ‘malware-feature-engineering-full-train-and-test’]
[‘train.csv’, ‘sample_submission.csv’, ‘test.csv’]
[’__output__.json’, ‘custom.css’, ‘new_test.csv’, ‘__results__.html’, ‘new_train.csv’]

columns_to_use = ['ProductName', 'EngineVersion', 'AppVersion', 'AvSigVersion', 'IsBeta','RtpStateBitfield', 'IsSxsPassiveMode', 'DefaultBrowsersIdentifier','AVProductStatesIdentifier', 'AVProductsInstalled', 'AVProductsEnabled','HasTpm', 'CountryIdentifier', 'CityIdentifier','OrganizationIdentifier', 'GeoNameIdentifier','LocaleEnglishNameIdentifier', 'Platform', 'Processor', 'OsVer','OsBuild', 'OsSuite', 'OsPlatformSubRelease', 'OsBuildLab','SkuEdition', 'IsProtected', 'AutoSampleOptIn', 'SMode','IeVerIdentifier', 'SmartScreen', 'Firewall', 'UacLuaenable','Census_MDC2FormFactor', 'Census_DeviceFamily','Census_OEMNameIdentifier', 'Census_OEMModelIdentifier','Census_ProcessorCoreCount', 'Census_ProcessorManufacturerIdentifier','Census_ProcessorModelIdentifier', 'Census_ProcessorClass','Census_PrimaryDiskTotalCapacity', 'Census_PrimaryDiskTypeName','Census_SystemVolumeTotalCapacity', 'Census_HasOpticalDiskDrive','Census_TotalPhysicalRAM', 'Census_ChassisTypeName','Census_InternalPrimaryDiagonalDisplaySizeInInches','Census_InternalPrimaryDisplayResolutionHorizontal','Census_InternalPrimaryDisplayResolutionVertical','Census_PowerPlatformRoleName', 'Census_InternalBatteryType','Census_InternalBatteryNumberOfCharges', 'Census_OSVersion','Census_OSArchitecture', 'Census_OSBranch', 'Census_OSBuildNumber','Census_OSBuildRevision', 'Census_OSEdition', 'Census_OSSkuName','Census_OSInstallTypeName', 'Census_OSInstallLanguageIdentifier','Census_OSUILocaleIdentifier', 'Census_OSWUAutoUpdateOptionsName','Census_IsPortableOperatingSystem', 'Census_GenuineStateName','Census_ActivationChannel', 'Census_IsFlightingInternal','Census_IsFlightsDisabled', 'Census_FlightRing','Census_ThresholdOptIn', 'Census_FirmwareManufacturerIdentifier','Census_FirmwareVersionIdentifier', 'Census_IsSecureBootEnabled','Census_IsWIMBootEnabled', 'Census_IsVirtualDevice','Census_IsTouchEnabled', 'Census_IsPenCapable','Census_IsAlwaysOnAlwaysConnectedCapable', 'Wdft_IsGamer','Wdft_RegionIdentifier'] new_train = pd.read_csv('../input/malware-feature-engineering-full-train-and-test/new_train.csv', nrows=1000000, usecols = columns_to_use) print(new_train.shape) print(new_train.head())

輸出:
(1000000, 80)

ProductName EngineVersion AppVersion AvSigVersion IsBeta RtpStateBitfield IsSxsPassiveMode DefaultBrowsersIdentifier AVProductStatesIdentifier AVProductsInstalled AVProductsEnabled HasTpm CountryIdentifier CityIdentifier OrganizationIdentifier GeoNameIdentifier LocaleEnglishNameIdentifier Platform Processor OsVer OsBuild OsSuite OsPlatformSubRelease OsBuildLab SkuEdition IsProtected AutoSampleOptIn SMode IeVerIdentifier SmartScreen Firewall UacLuaenable Census_MDC2FormFactor Census_DeviceFamily Census_OEMNameIdentifier Census_OEMModelIdentifier Census_ProcessorCoreCount Census_ProcessorManufacturerIdentifier Census_ProcessorModelIdentifier Census_ProcessorClass Census_PrimaryDiskTotalCapacity Census_PrimaryDiskTypeName Census_SystemVolumeTotalCapacity Census_HasOpticalDiskDrive Census_TotalPhysicalRAM Census_ChassisTypeName Census_InternalPrimaryDiagonalDisplaySizeInInches Census_InternalPrimaryDisplayResolutionHorizontal Census_InternalPrimaryDisplayResolutionVertical Census_PowerPlatformRoleName Census_InternalBatteryType Census_InternalBatteryNumberOfCharges Census_OSVersion Census_OSArchitecture Census_OSBranch Census_OSBuildNumber Census_OSBuildRevision Census_OSEdition Census_OSSkuName Census_OSInstallTypeName Census_OSInstallLanguageIdentifier Census_OSUILocaleIdentifier Census_OSWUAutoUpdateOptionsName Census_IsPortableOperatingSystem Census_GenuineStateName Census_ActivationChannel Census_IsFlightingInternal Census_IsFlightsDisabled Census_FlightRing Census_ThresholdOptIn Census_FirmwareManufacturerIdentifier Census_FirmwareVersionIdentifier Census_IsSecureBootEnabled Census_IsWIMBootEnabled Census_IsVirtualDevice Census_IsTouchEnabled Census_IsPenCapable Census_IsAlwaysOnAlwaysConnectedCapable Wdft_IsGamer Wdft_RegionIdentifier
0 0 0 0 0 0 0 0 -1 0 0 0 1 0 202.0 0 0 0 0 0 0 0 0 0 0 0 1.0 0 0.0 0 -1 1.0 0 0 0 0 20832.0 4.0 0 0 -1 476940.0 0 299451.0 0 4096.0 0 18.9 1440.0 900.0 0 -1 4.294967e+09 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN 0.0 0 NaN 0 2516.0 0 NaN 0.0 0 0 0.0 0.0 0
1 0 1 1 1 0 0 0 -1 0 0 0 1 1 164.0 0 1 1 0 0 0 0 0 0 0 0 1.0 0 0.0 0 -1 1.0 0 1 0 0 98328.0 4.0 0 1 -1 476940.0 0 102385.0 0 4096.0 1 13.9 1366.0 768.0 1 -1 1.000000e+00 1 0 0 0 1 0 0 1 1 1 0 0 1 0 NaN 0.0 1 NaN 0 1767.0 0 NaN 0.0 0 0 0.0 0.0 1
2 0 0 0 2 0 0 0 -1 0 0 0 1 2 685.0 0 2 2 0 0 0 0 1 0 0 1 1.0 0 0.0 0 0 1.0 0 0 0 1 2.0 4.0 0 2 -1 114473.0 1 113907.0 0 4096.0 0 21.5 1920.0 1080.0 0 -1 4.294967e+09 0 0 0 0 0 1 1 0 2 2 1 0 0 1 NaN 0.0 0 NaN 1 190.0 0 NaN 0.0 0 0 0.0 0.0 2
3 0 0 0 3 0 0 0 -1 0 0 0 1 3 20.0 -1 3 3 0 0 0 0 0 0 0 0 1.0 0 0.0 0 1 1.0 0 0 0 2 171.0 4.0 0 3 -1 238475.0 2 227116.0 0 4096.0 2 18.5 1366.0 768.0 0 -1 4.294967e+09 2 0 0 0 2 0 0 0 3 3 1 0 0 1 NaN 0.0 0 NaN 2 33.0 0 NaN 0.0 0 0 0.0 0.0 2
4 0 0 0 4 0 0 0 -1 0 0 0 1 4 15.0 -1 4 4 0 0 0 0 1 0 0 1 1.0 0 0.0 0 0 1.0 0 1 0 2 2263.0 4.0 0 4 -1 476940.0 0 101900.0 0 6144.0 3 14.0 1366.0 768.0 1 0 0.000000e+00 3 0 0 0 3 1 1 2 1 1 1 0 0 0 0.0 0.0 0 0.0 2 124.0 0 0.0 0.0 0 0 0.0 0.0 3
cat_features = ['PuaMode'] new_test = pd.read_csv('../input/malware-feature-engineering-full-train-and-test/new_test.csv', nrows=1000000, usecols = columns_to_use) print(new_test.shape)

輸出:
(1000000, 80)

new_train['target'] = 0 new_test['target'] = 1new_train = pd.concat([new_train, new_test], axis =0)target = new_train['target'].valuesdel new_train['target'] del new_testnew_train, new_val, target_train, target_val = train_test_split(new_train, target, test_size=0.2, random_state=42)param = {'num_leaves': 200,'min_data_in_leaf': 60, 'objective':'binary','max_depth': -1,'learning_rate': 0.1,"min_child_samples": 20,"boosting": "gbdt","feature_fraction": 0.8,"bagging_freq": 1,"bagging_fraction": 0.8 ,"bagging_seed": 17,"metric": 'auc',"lambda_l1": 0.1,"verbosity": -1,"n_jobs":-1}new_train = lgb.Dataset(new_train.values, label=target_train) new_val = lgb.Dataset(new_val.values, label=target_val)num_round = 1000 clf = lgb.train(param, new_train, num_round, valid_sets = [new_train, new_val], verbose_eval=10, early_stopping_rounds = 25)

Training until validation scores don’t improve for 25 rounds.
[10] ?training’s auc: 0.977506 valid_1’s auc: 0.977521
[20] ?training’s auc: 0.978298 valid_1’s auc: 0.978195
[30] ?training’s auc: 0.978955 valid_1’s auc: 0.978624
[40] ?training’s auc: 0.979589 valid_1’s auc: 0.979024
[50] ?training’s auc: 0.980195 valid_1’s auc: 0.979331
[60] ?training’s auc: 0.980738 valid_1’s auc: 0.979562
[70] ?training’s auc: 0.981254 valid_1’s auc: 0.979729
[80] ?training’s auc: 0.981701 valid_1’s auc: 0.979824
[90] ?training’s auc: 0.982138 valid_1’s auc: 0.979934
[100] training’s auc: 0.982507 valid_1’s auc: 0.979991
[110] training’s auc: 0.98287 ?valid_1’s auc: 0.980026
[120] training’s auc: 0.983184 valid_1’s auc: 0.980058
[130] training’s auc: 0.98349 ?valid_1’s auc: 0.980061
[140] training’s auc: 0.983802 valid_1’s auc: 0.980066
[150] training’s auc: 0.984118 valid_1’s auc: 0.980061
[160] training’s auc: 0.984421 valid_1’s auc: 0.980064
Early stopping, best iteration is:
[136] training’s auc: 0.983674 valid_1’s auc: 0.980071

通過對抗性驗證之后,可以得到為生成與test.csv同分布數據集的原數據集特征的貢獻度的排名,并以圖形表示出來。


import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.simplefilter(action='ignore', category=FutureWarning)feature_imp = pd.DataFrame(sorted(zip(clf.feature_importance(),columns_to_use), reverse=True), columns=['Value','Feature'])plt.figure(figsize=(20, 10)) sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False)) plt.title('LightGBM Features (avg over folds)') plt.tight_layout() plt.show() plt.savefig('lgbm_importances-01.png')

最后,我們可以根據這個 樣本重要性 排行榜來選擇樣本作為驗證集

如何利用這個排行榜:
??在原始數據中,是存在許多缺失值的,有許多的值的命名也不規范(例如字符串型的特征值),那么,我們要選擇哪些樣本呢?這時候就可以通過這個排行榜。
??舉個例子:這個排行榜中的的第一名的特征是’AvSiaVersion’,我們把那些在這個特征上的值是缺失值的樣本全部移除,從剩下的樣本中挑選出驗證集。

總結

以上是生活随笔為你收集整理的Adversarial Validation 微软恶意代码比赛的一个kenel的解析的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。