當前位置：首頁 > 编程语言 > python >内容正文

python

python随机森林变量重要性_利用随机森林对特征重要性进行评估

發布時間：2023/12/8 python 41 豆豆

生活随笔收集整理的這篇文章主要介紹了 python随机森林变量重要性_利用随机森林对特征重要性进行评估小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

前言

隨機森林是以決策樹為基學習器的集成學習算法。隨機森林非常簡單，易于實現，計算開銷也很小，更令人驚奇的是它在分類和回歸上表現出了十分驚人的性能，因此，隨機森林也被譽為“代表集成學習技術水平的方法”。

本文是對隨機森林如何用在特征選擇上做一個簡單的介紹。

隨機森林(RF)簡介

只要了解決策樹的算法，那么隨機森林是相當容易理解的。隨機森林的算法可以用如下幾個步驟概括：

用有抽樣放回的方法(bootstrap)從樣本集中選取n個樣本作為一個訓練集

用抽樣得到的樣本集生成一棵決策樹。在生成的每一個結點：

隨機不重復地選擇d個特征

利用這d個特征分別對樣本集進行劃分，找到最佳的劃分特征(可用基尼系數、增益率或者信息增益判別)

重復步驟1到步驟2共k次，k即為隨機森林中決策樹的個數。

用訓練得到的隨機森林對測試樣本進行預測，并用票選法決定預測的結果。

下圖比較直觀地展示了隨機森林算法：

圖1：隨機森林算法示意圖

沒錯，就是這個到處都是隨機取值的算法，在分類和回歸上有著極佳的效果，是不是覺得強的沒法解釋~

然而本文的重點不是這個，而是接下來的特征重要性評估。

特征重要性評估

sklearn 已經幫我們封裝好了一切，我們只需要調用其中的函數即可。我們以UCI上葡萄酒的例子為例，首先導入數據集。

import pandas as pd

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'

df = pd.read_csv(url, header = None)

df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',

'Alcalinity of ash', 'Magnesium', 'Total phenols',

'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',

'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']

然后，我們來大致看下這是一個怎么樣的數據集

import numpy as np

np.unique(df['Class label'])

輸出為

array([1, 2, 3], dtype=int64)

可見共有3個類別。然后再來看下數據的信息：

df.info()

輸出為:

RangeIndex: 178 entries, 0 to 177

Data columns (total 14 columns):

Class label 178 non-null int64

Alcohol 178 non-null float64

Malic acid 178 non-null float64

Ash 178 non-null float64

Alcalinity of ash 178 non-null float64

Magnesium 178 non-null int64

Total phenols 178 non-null float64

Flavanoids 178 non-null float64

Nonflavanoid phenols 178 non-null float64

Proanthocyanins 178 non-null float64

Color intensity 178 non-null float64

Hue 178 non-null float64

OD280/OD315 of diluted wines 178 non-null float64

Proline 178 non-null int64

dtypes: float64(11), int64(3)

memory usage: 19.5 KB

可見除去class label之外共有13個特征，數據集的大小為178。

按照常規做法，將數據集分為訓練集和測試集。此處注意：sklearn.cross_validation 模塊在0.18版本中被棄用，支持所有重構的類和函數都被移動到了model_selection模塊。從sklearn.model_selection引入train_test_split

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

x, y = df.iloc[:, 1:].values, df.iloc[:, 0].values

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

feat_labels = df.columns[1:]

forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)

forest.fit(x_train, y_train)

好了，這樣一來隨機森林就訓練好了，其中已經把特征的重要性評估也做好了，我們拿出來看下。

importances = forest.feature_importances_

indices = np.argsort(importances)[::-1]

for f in range(x_train.shape[1]):

print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))

輸出的結果為

1) Color intensity 0.182483

2) Proline 0.158610

3) Flavanoids 0.150948

4) OD280/OD315 of diluted wines 0.131987

5) Alcohol 0.106589

6) Hue 0.078243

7) Total phenols 0.060718

8) Alcalinity of ash 0.032033

9) Malic acid 0.025400

10) Proanthocyanins 0.022351

11) Magnesium 0.022078

12) Nonflavanoid phenols 0.014645

13) Ash 0.013916

對的就是這么方便。

如果要篩選出重要性比較高的變量的話，這么做就可以

threshold = 0.15

x_selected = x_train[:, importances > threshold]

x_selected.shape

輸出為

(124, 3)

這樣，幫我們選好了3個重要性大于0.15的特征。

總結

以上是生活随笔為你收集整理的python随机森林变量重要性_利用随机森林对特征重要性进行评估的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：综合治理GIS方案（综治）
下一篇： python随机森林特征重要性_基于随机