當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

“7th-place-solution-microsoft-malware-prediction”——kaggle微软恶意代码检测比赛第七名代码

發(fā)布時間：2025/3/15 编程问答 18 豆豆

生活随笔收集整理的這篇文章主要介紹了 “7th-place-solution-microsoft-malware-prediction”——kaggle微软恶意代码检测比赛第七名代码小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

代碼來源?

前言

閱讀別的的優(yōu)秀代碼有助于提高自己的代碼編寫能力，從中我們不僅能學(xué)習(xí)到許多的編程知識，還能借鑒他人優(yōu)秀的編程習(xí)慣，也能學(xué)習(xí)到別人獨特的編程技巧。這篇博客是博主對微軟2019惡意軟件檢測比賽第七名的一些個人總結(jié)和看法，有些代碼上博主已經(jīng)給了注釋，同時也會額外給代碼另外進(jìn)行注釋。由于博主能力有限，錯誤的出現(xiàn)在所難免，還望技術(shù)愛好者們不吝賜教。

正文

概要

眾所周知，機器學(xué)習(xí)分類模型的構(gòu)建主要由兩部分組成1.數(shù)據(jù)預(yù)處理（包括數(shù)據(jù)清洗、特征工程等） 2.機器學(xué)習(xí)模型構(gòu)建（訓(xùn)練、調(diào)參），而數(shù)據(jù)預(yù)處理是機器學(xué)習(xí)模型構(gòu)建的前期工作，用于訓(xùn)練的數(shù)據(jù)的質(zhì)量在很大程度決定了最后的機器學(xué)習(xí)模型的質(zhì)量，所以一般的機器學(xué)習(xí)項目的代碼絕大篇幅都是處理數(shù)據(jù)的代碼，這份代碼也是如此。 個人認(rèn)為，這份代碼的的數(shù)據(jù)處理不算很好，但也還算過得去（如果想了解比較有趣的數(shù)據(jù)預(yù)處理代碼請看博主的另一篇博客?)。這份代碼所使用的機器學(xué)習(xí)算法是lightGBM。

代碼詳解

說明:
博主會把代碼分開來講解，但由于設(shè)備原因無法把每一步的代碼結(jié)果顯示出來，條件允許的技術(shù)愛好者們可以自己復(fù)制代碼自己去run一下，代碼中使用的文件在官網(wǎng)可以下載。雖然是步講解，但是從上往下把代碼拼接起來的是完整的代碼。

數(shù)據(jù)預(yù)處理部分

庫的導(dǎo)入

#imports import numpy as np import pandas as pd import gc # python 的垃圾收集機制 import time # 貌似在這份代碼中沒有用...... import random # 隨機數(shù) from lightgbm import LGBMClassifier # lightGBM 算法庫 from sklearn.metrics import roc_auc_score, roc_curve # AUC ROC 模型分類能力的一種評估標(biāo)準(zhǔn) from sklearn.model_selection import StratifiedKFold # 訓(xùn)練集和驗證集的劃分 import matplotlib.pyplot as plot #可視化 import seaborn as sb #可視化

實現(xiàn)功能前的預(yù)備階段

#vars dataFolder = '../input/' submissionFileName = 'submission' trainFile='train.csv' testFile='test.csv' #used 4000000 nr of rows in stead of 8000000 because of Kernel memory issue numberOfRows = 4000000seed = 6001 np.random.seed(seed) random.seed(seed)def displayImportances(featureImportanceDf, submissionFileName):# 根據(jù) importance 的降序排位來給 feature 排序，再將排序后的特征存入 cols （存的特征的名稱）cols = featureImportanceDf[["feature", "importance"]].groupby("feature").mean().sort_values(by = "importance", ascending = False).index# .loc() 不僅可以索引為參數(shù)，也可以以boolean為參數(shù)。boolean的操作單位是某個特征的特征值bestFeatures = featureImportanceDf.loc[featureImportanceDf.feature.isin(cols)] # isin()接受一個列表，判斷該列中元素是否在列表中，并返回boolean值plot.figure(figsize = (14, 14))sb.barplot(x = "importance", y = "feature", data = bestFeatures.sort_values(by = "importance", ascending = False))plot.title('LightGBM Features')plot.tight_layout()plot.savefig(submissionFileName + '.png')

這一段代碼，其實我覺得可以不用把路徑用幾個變量來表示（或許是代碼作者的編程習(xí)慣吧）。numberOfRows=4000000的用法要縱觀代碼才能知道，是這樣的，代碼作者把比賽官方給的train和test拼接在了一起，然后再選取前4000000個樣例作為訓(xùn)練集（最后被分為訓(xùn)練集和驗證集）。seed=6001及下面兩條代碼是為了生成隨機種子，但博主有個疑惑，為什么用了np.random.seed(seed)還要用 random.seed(seed)？,先按住不表，等我查好資料再來補充。至于那個自定義函數(shù)，是最后來保存輸出結(jié)果的。

為官方提供的文件中的特征設(shè)置類型
就是說原始數(shù)據(jù)中的特征只有特征值，官方是沒有標(biāo)出它是什么類型的數(shù)據(jù)，需要自己來設(shè)置。

dtypes = {'MachineIdentifier': 'category','ProductName': 'category','EngineVersion': 'category','AppVersion': 'category','AvSigVersion': 'category','IsBeta': 'int8','RtpStateBitfield': 'float16','IsSxsPassiveMode': 'int8','DefaultBrowsersIdentifier': 'float16','AVProductStatesIdentifier': 'float32','AVProductsInstalled': 'float16','AVProductsEnabled': 'float16','HasTpm': 'int8','CountryIdentifier': 'int16','CityIdentifier': 'float32','OrganizationIdentifier': 'float16','GeoNameIdentifier': 'float16','LocaleEnglishNameIdentifier': 'int8','Platform': 'category','Processor': 'category','OsVer': 'category','OsBuild': 'int16','OsSuite': 'int16','OsPlatformSubRelease': 'category','OsBuildLab': 'category','SkuEdition': 'category','IsProtected': 'float16','AutoSampleOptIn': 'int8','PuaMode': 'category','SMode': 'float16','IeVerIdentifier': 'float16','SmartScreen': 'category','Firewall': 'float16','UacLuaenable': 'float32','Census_MDC2FormFactor': 'category','Census_DeviceFamily': 'category','Census_OEMNameIdentifier': 'float16','Census_OEMModelIdentifier': 'float32','Census_ProcessorCoreCount': 'float16','Census_ProcessorManufacturerIdentifier': 'float16','Census_ProcessorModelIdentifier': 'float16','Census_ProcessorClass': 'category','Census_PrimaryDiskTotalCapacity': 'float32','Census_PrimaryDiskTypeName': 'category','Census_SystemVolumeTotalCapacity': 'float32','Census_HasOpticalDiskDrive': 'int8','Census_TotalPhysicalRAM': 'float32','Census_ChassisTypeName': 'category','Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float16','Census_InternalPrimaryDisplayResolutionHorizontal': 'float16','Census_InternalPrimaryDisplayResolutionVertical': 'float16','Census_PowerPlatformRoleName': 'category','Census_InternalBatteryType': 'category','Census_InternalBatteryNumberOfCharges': 'float32','Census_OSVersion': 'category','Census_OSArchitecture': 'category','Census_OSBranch': 'category','Census_OSBuildNumber': 'int16','Census_OSBuildRevision': 'int32','Census_OSEdition': 'category','Census_OSSkuName': 'category','Census_OSInstallTypeName': 'category','Census_OSInstallLanguageIdentifier': 'float16','Census_OSUILocaleIdentifier': 'int16','Census_OSWUAutoUpdateOptionsName': 'category','Census_IsPortableOperatingSystem': 'int8','Census_GenuineStateName': 'category','Census_ActivationChannel': 'category','Census_IsFlightingInternal': 'float16','Census_IsFlightsDisabled': 'float16','Census_FlightRing': 'category','Census_ThresholdOptIn': 'float16','Census_FirmwareManufacturerIdentifier': 'float16','Census_FirmwareVersionIdentifier': 'float32','Census_IsSecureBootEnabled': 'int8','Census_IsWIMBootEnabled': 'float16','Census_IsVirtualDevice': 'float16','Census_IsTouchEnabled': 'int8','Census_IsPenCapable': 'int8','Census_IsAlwaysOnAlwaysConnectedCapable': 'float16','Wdft_IsGamer': 'float16','Wdft_RegionIdentifier': 'float16','HasDetections': 'int8'}

特征選擇

selectedFeatures = [ 'AVProductStatesIdentifier','AVProductsEnabled','IsProtected','Processor','OsSuite','IsProtected','RtpStateBitfield','AVProductsInstalled','Wdft_IsGamer','DefaultBrowsersIdentifier','OsBuild','Wdft_RegionIdentifier','SmartScreen','CityIdentifier','AppVersion','Census_IsSecureBootEnabled','Census_PrimaryDiskTypeName','Census_SystemVolumeTotalCapacity','Census_HasOpticalDiskDrive','Census_IsWIMBootEnabled','Census_IsVirtualDevice','Census_IsTouchEnabled','Census_FirmwareVersionIdentifier','GeoNameIdentifier','IeVerIdentifier','Census_FirmwareManufacturerIdentifier','Census_InternalPrimaryDisplayResolutionHorizontal','Census_InternalPrimaryDisplayResolutionVertical','Census_OEMModelIdentifier','Census_ProcessorModelIdentifier','Census_OSVersion','Census_InternalPrimaryDiagonalDisplaySizeInInches','Census_OEMNameIdentifier','Census_ChassisTypeName','Census_OSInstallLanguageIdentifier','EngineVersion','OrganizationIdentifier' ,'CountryIdentifier' ,'Census_ActivationChannel','Census_ProcessorCoreCount','Census_OSWUAutoUpdateOptionsName','Census_InternalBatteryType']

代碼作者因為具備非常非常深厚的數(shù)據(jù)處理技術(shù)功底，他可能是根據(jù)以前對惡意代碼數(shù)據(jù)處理的經(jīng)驗直接選擇了這些特征來給機器學(xué)習(xí)模型進(jìn)行訓(xùn)練。所以說，特征是不能亂選的，如果沒有代碼作者那樣的技術(shù)，還是借鑒別人的數(shù)據(jù)預(yù)處理方法進(jìn)行特征篩選吧。

載入數(shù)據(jù)

# Load Data with selected features trainDf = pd.read_csv(dataFolder + trainFile, dtype=dtypes,usecols=selectedFeatures, low_memory=True, nrows = numberOfRows) # 訓(xùn)練集 labels = pd.read_csv(dataFolder + trainFile, usecols = ['HasDetections'], nrows = numberOfRows) # 標(biāo)簽 testDf = pd.read_csv(dataFolder + testFile,dtype=dtypes, usecols=selectedFeatures, low_memory=True) #測試集 print('== Dataset Shapes ==') print('Train : ' + str(trainDf.shape)) # trainDf.shape 是 tuple 類型 print('Labels : ' + str(labels.shape)) print('Test : ' + str(testDf.shape))# Append Datasets and Cleanup df = trainDf.append(testDf).reset_index() # 從這里可以看到 .append() 對DataFrame來說一樣有效，不僅可以用在 list 上,并且會出現(xiàn)新的‘index’列（用來保存原來的index）。這里是上下拼接。 del trainDf, testDf # 刪除 trainDf testDf 節(jié)省內(nèi)存 gc.collect()

df 是將train和test拼接之后的新的DataFrame。

對特征 ‘SmartScreen’ 的特征值進(jìn)行處理

# Modify SmartScreen Feature df.loc[df.SmartScreen == 'off', 'SmartScreen'] = 'Off' # df.SmartScreen=='off'是條件 df.loc[df.SmartScreen == 'of', 'SmartScreen'] = 'Off' df.loc[df.SmartScreen == 'OFF', 'SmartScreen'] = 'Off' df.loc[df.SmartScreen == '00000000', 'SmartScreen'] = 'Off' df.loc[df.SmartScreen == '0', 'SmartScreen'] = 'Off' df.loc[df.SmartScreen == 'ON', 'SmartScreen'] = 'On' df.loc[df.SmartScreen == 'on', 'SmartScreen'] = 'On' df.loc[df.SmartScreen == 'Enabled', 'SmartScreen'] = 'On' df.loc[df.SmartScreen == 'BLOCK', 'SmartScreen'] = 'Block' df.loc[df.SmartScreen == 'requireadmin', 'SmartScreen'] = 'RequireAdmin' df.loc[df.SmartScreen == 'requireAdmin', 'SmartScreen'] = 'RequireAdmin' df.loc[df.SmartScreen == 'RequiredAdmin', 'SmartScreen'] = 'RequireAdmin' df.loc[df.SmartScreen == 'Promt', 'SmartScreen'] = 'Prompt' df.loc[df.SmartScreen == 'Promprt', 'SmartScreen'] = 'Prompt' df.loc[df.SmartScreen == 'prompt', 'SmartScreen'] = 'Prompt' df.loc[df.SmartScreen == 'warn', 'SmartScreen'] = 'Warn' df.loc[df.SmartScreen == 'Deny', 'SmartScreen'] = 'Block' df.loc[df.SmartScreen == '', 'SmartScreen'] = 'Off'

在這里我們能學(xué)到一種從某特征中取特定值的方法：通過設(shè)定條件來取特征中的目標(biāo)特征值

將每種特征的個特征值出現(xiàn)次數(shù)統(tǒng)計出來再生成一個新的DataFrame

#Count Encoding (with exceptions) for col in [f for f in df.columns if f not in ['index','HasDetections','Census_SystemVolumeTotalCapacity']]:df[col]=df[col].map(df[col].value_counts()) # col列中的特征值換成該特征值在該特征中出現(xiàn)的次數(shù)dfDummy = pd.get_dummies(df, dummy_na=True) # 對 df 進(jìn)行獨熱編碼，dummy_na=True 表示考慮缺失值NaN print('Dummy: ' + str(dfDummy.shape))# Cleanup del df gc.collect()# Summary Shape print('== Dataset Shapes ==') print('Train: ' + str(train.shape)) print('Test: ' + str(test.shape))# Summary Columns print('== Dataset Columns ==') features = [f for f in train.columns if f not in ['index']] for feature in features:print(feature)

df[col].map(df[col].value_counts()) 通過.map()函數(shù)將每個特征值的出現(xiàn)次數(shù)映射到原來存放特征值的那個位置 (如果是函數(shù)意思不懂的話博主建議自己去查一下，這里只給出代碼的意義)。這行代碼是很有技巧的，因為它只用了一行代碼就對每個特征中存放的值從特征值換成了特征值出現(xiàn)次數(shù)，也就是所謂的頻率（更正式的“頻率”應(yīng)該是出現(xiàn)次數(shù)除以100），那為什么要修改為頻率呢？那是因為lightGBM算法是基于頻率的。

feature 在上面我們把 train 和 test 拼接起來的時候使用了函數(shù) .reset_index()，會出現(xiàn)新的一列’index’保存原來的索引，所以在這里我們要 not in ['index']
``

df[col]=df[col].map(df[col].value_counts())
這行代碼比較難，我這里放個例子給大家看看

機器學(xué)習(xí)模型構(gòu)建部分

訓(xùn)練模塊

# CV Folds folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = seed)# Create arrays and dataframes to store results oofPreds = np.zeros(train.shape[0]) # numpy.ndarray 類型 subPreds = np.zeros(test.shape[0]) # numpy.ndarray 類型 featureImportanceDf = pd.DataFrame()# Loop through all Folds. for n_fold, (trainXId, validXId) in enumerate(folds.split(train[features], labels)): # enumerate 為每個元素標(biāo)個索引，并且將該索引與相應(yīng)的值合并為一個元組，這里應(yīng)該有5個元組，因為折了5次# Create TrainXY and ValidationXY set based on fold-indexestrainX, trainY = train[features].iloc[trainXId], labels.iloc[trainXId]validX, validY = train[features].iloc[validXId], labels.iloc[validXId]print('== Fold: ' + str(n_fold)) # 強制轉(zhuǎn)化為 str 類型應(yīng)該是代碼作者的習(xí)慣，其實直接顯示數(shù)值也行的# LightGBM parameterslgbm = LGBMClassifier(objective = 'binary',boosting_type = 'gbdt',n_estimators = 2500,learning_rate = 0.05, num_leaves = 250,min_data_in_leaf = 125, bagging_fraction = 0.901,max_depth = 13, reg_alpha = 2.5,reg_lambda = 2.5,min_split_gain = 0.0001,min_child_weight = 25,feature_fraction = 0.5, silent = -1,verbose = -1,#n_jobs is set to -1 instead of 4 otherwise the kernell will time outn_jobs = -1) lgbm.fit(trainX, trainY, eval_set=[(trainX, trainY), (validX, validY)], eval_metric = 'auc', verbose = 250, early_stopping_rounds = 100)# 通過分類器模型對驗證集預(yù)測為正樣本的概率和驗證集的真實標(biāo)簽計算AUC來檢測分類器模型的分類效果oofPreds[validXId] = lgbm.predict_proba(validX, num_iteration = lgbm.best_iteration_)[:, 1] # 驗證集中樣本預(yù)測為1(正樣本)的概率print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(validY, oofPreds[validXId]))) # 通過驗證集的標(biāo)簽和預(yù)測為正樣本的概率計算AUC# cleanupprint('Cleanup')del trainX, trainY, validX, validYgc.collect()subPreds += lgbm.predict_proba(test[features], num_iteration = lgbm.best_iteration_)[:, 1] / folds.n_splits # 對測試集進(jìn)行預(yù)測，并返回預(yù)測為正例的概率， folds.n_splits = 5 （折了5次）# Feature Importancefold_importance_df = pd.DataFrame()fold_importance_df["feature"] = featuresfold_importance_df["importance"] = lgbm.feature_importances_ # .feature_importances_：特征重要性，特征越重要該值越大fold_importance_df["fold"] = n_fold + 1featureImportanceDf = pd.concat([featureImportanceDf, fold_importance_df], axis=0) # 垂直拼接，并保留原index# cleanupprint('Cleanup. Post-Fold')del lgbmgc.collect()print('Full AUC score %.6f' % roc_auc_score(labels, oofPreds)) # 全部樣本的AUC值

1.oofPreds = np.zeros(train.shape[0]) ：創(chuàng)建一個與 train 行長度相等的元素為0的數(shù)組
subPreds = np.zeros(test.shape[0]) ：創(chuàng)建一個與 test 行長度相等的元素為0的數(shù)組

2.oofPreds = np.zeros(train.shape[0]) subPreds = np.zeros(test.shape[0])是 numpy.ndarray類型，因為roc_auc_score()參數(shù)得是array類型。

3.經(jīng)過訓(xùn)練，我們可以計算AUC值來檢測分類效果

oofPreds[validXId] = lgbm.predict_proba(validX, num_iteration = lgbm.best_iteration_)[:, 1] 驗證集中樣本預(yù)測為1(正樣本)的概率
roc_auc_score(validY, oofPreds[validXId])) 過驗證集的標(biāo)簽和預(yù)測驗證集為正樣本的概率計算AUC

保存文件、可視化模塊(可視化函數(shù)在代碼最上面定義了)

# Feature Importance displayImportances(featureImportanceDf, submissionFileName) # Generate Submission kaggleSubmission = pd.read_csv(dataFolder + 'sample_submission.csv') kaggleSubmission['HasDetections'] = subPreds kaggleSubmission.to_csv(submissionFileName + '.csv', index = False)

總結(jié)

以上是生活随笔為你收集整理的“7th-place-solution-microsoft-malware-prediction”——kaggle微软恶意代码检测比赛第七名代码的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：合并DateFrame之—— appen
下一篇： DAE(去噪自动编码器)