當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

用xgboost模型对特征重要性进行排序

發布時間：2023/12/8 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了用xgboost模型对特征重要性进行排序小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

用xgboost模型對特征重要性進行排序

在這篇文章中，你將會學習到：

xgboost對預測模型特征重要性排序的原理（即為什么xgboost可以對預測模型特征重要性進行排序）。

如何繪制xgboost模型得到的特征重要性條形圖。

如何根據xgboost模型得到的特征重要性，在scikit-learn進行特征選擇。

梯度提升算法是如何計算特征重要性的？

使用梯度提升算法的好處是在提升樹被創建后，可以相對直接地得到每個屬性的重要性得分。一般來說，重要性分數，衡量了特征在模型中的提升決策樹構建中價值。一個屬性越多的被用來在模型中構建決策樹，它的重要性就相對越高。

屬性重要性是通過對數據集中的每個屬性進行計算，并進行排序得到。在單個決策書中通過每個屬性分裂點改進性能度量的量來計算屬性重要性，由節點負責加權和記錄次數。也就說一個屬性對分裂點改進性能度量越大（越靠近根節點），權值越大；被越多提升樹所選擇，屬性越重要。性能度量可以是選擇分裂節點的Gini純度，也可以是其他度量函數。

最終將一個屬性在所有提升樹中的結果進行加權求和后然后平均，得到重要性得分。

繪制特征重要性

一個已訓練的xgboost模型能夠自動計算特征重要性，這些重要性得分可以通過成員變量feature_importances_得到。可以通過如下命令打印：

print(model.feature_importances_)

我們可以直接在條形圖上繪制這些分數，以獲得數據集中每個特征的相對重要性的直觀顯示。例如：

# plot pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) pyplot.show()

我們可以通過在the Pima Indians onset of diabetes 數據集上訓練XGBOOST模型來演示，并從計算的特征重要性中繪制條形圖。

# plot feature importance manually from numpy import loadtxt from xgboost import XGBClassifier from matplotlib import pyplot # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] y = dataset[:,8] # fit model no training data model = XGBClassifier() model.fit(X, y) # feature importance print(model.feature_importances_) # plot pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) pyplot.show()

運行這個示例，首先的輸出特征重要性分數：

[0.089701,?0.17109634,?0.08139535,?0.04651163,?0.10465116,?0.2026578,?0.1627907,?0.14119601]

相對重要性條形圖：

這種繪制的缺點在于，只顯示了特征重要性而沒有排序，可以在繪制之前對特征重要性得分進行排序。

通過內建的繪制函數進行特征重要性得分排序后的繪制，這個函數就是plot_importance()，示例如下：

# plot feature importance using built-in function from numpy import loadtxt from xgboost import XGBClassifier from xgboost import plot_importance from matplotlib import pyplot # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] y = dataset[:,8] # fit model no training data model = XGBClassifier() model.fit(X, y) # plot feature importance plot_importance(model) pyplot.show()

運行示例得到條形圖：

根據其在輸入數組中的索引，特征被自動命名為f0-f7。在問題描述中手動的將這些索引映射到名稱，我們可以看到，F5（身體質量指數）具有最高的重要性，F3（皮膚折疊厚度）具有最低的重要性。

根據xgboost特征重要性得分進行特征選擇

特征重要性得分，可以用于在scikit-learn中進行特征選擇。通過SelectFromModel類實現，該類采用模型并將數據集轉換為具有選定特征的子集。這個類可以采取預先訓練的模型，例如在整個數據集上訓練的模型。然后，它可以閾值來決定選擇哪些特征。當在SelectFromModel實例上調用transform()方法時，該閾值被用于在訓練集和測試集上一致性選擇相同特征。

在下面的示例中，我們首先在訓練集上訓練xgboost模型，然后在測試上評估。使用從訓練數據集計算的特征重要性，然后，將模型封裝在一個SelectFromModel實例中。我們使用這個來選擇訓練集上的特征，用所選擇的特征子集訓練模型，然后在相同的特征方案下對測試集進行評估。

示例：

# select features using threshold selection = SelectFromModel(model, threshold=thresh, prefit=True) select_X_train = selection.transform(X_train) # train model selection_model = XGBClassifier() selection_model.fit(select_X_train, y_train) # eval model select_X_test = selection.transform(X_test) y_pred = selection_model.predict(select_X_test)

我們可以通過測試多個閾值，來從特征重要性中選擇特征。具體而言，每個輸入變量的特征重要性，本質上允許我們通過重要性來測試每個特征子集。

完整示例如下：

# use feature importance for feature selection from numpy import loadtxt from numpy import sort from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectFromModel # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7) # fit model on all training data model = XGBClassifier() model.fit(X_train, y_train) # make predictions for test data and evaluate y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0)) # Fit model using each importance as a threshold thresholds = sort(model.feature_importances_) for thresh in thresholds:# select features using thresholdselection = SelectFromModel(model, threshold=thresh, prefit=True)select_X_train = selection.transform(X_train)# train modelselection_model = XGBClassifier()selection_model.fit(select_X_train, y_train)# eval modelselect_X_test = selection.transform(X_test)y_pred = selection_model.predict(select_X_test)predictions = [round(value) for value in y_pred]accuracy = accuracy_score(y_test, predictions)print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

運行示例，得到輸出：

Accuracy: 77.95% Thresh=0.071, n=8, Accuracy: 77.95% Thresh=0.073, n=7, Accuracy: 76.38% Thresh=0.084, n=6, Accuracy: 77.56% Thresh=0.090, n=5, Accuracy: 76.38% Thresh=0.128, n=4, Accuracy: 76.38% Thresh=0.160, n=3, Accuracy: 74.80% Thresh=0.186, n=2, Accuracy: 71.65% Thresh=0.208, n=1, Accuracy: 63.78%

我們可以看到，模型的性能通常隨著所選擇的特征的數量而減少。在這一問題上，可以對測試集準確率和模型復雜度做一個權衡，例如選擇4個特征，接受準確率從77.95%降到76.38%。這可能是對這樣一個小數據集的清洗，但對于更大的數據集和使用交叉驗證作為模型評估方案可能是更有用的策略。

總結

以上是生活随笔為你收集整理的用xgboost模型对特征重要性进行排序的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。