當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

随机森林模型

發布時間：2023/12/9 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了随机森林模型小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

隨機森林模型

集成模型簡介
- Bagging算法
- Boosting算法
隨機森林模型
- 1、基本原理
- - 1、數據隨機
  - 2、特征隨機
隨機森林的代碼實現
- 案例實戰
- - 量化金融股票
- 參數調優

集成模型簡介

集成學習模型使用一系列弱學習器（基礎模型或者基模型）進行學習。將各個弱學習器的結果進行整合，從而達到比單個學習器更好的學習效果。常見的算法有bagging算法和boosting算法。隨機森林就是典型的bagging算法，而boosting算法的典型學習模型有Adaboost、GBDT、XGBoost、LightGBM。

Bagging算法

bagging算法原理類似投票，每次使用一個訓練集訓練一個弱學習器，有放回地隨機抽取n次后，根據不同的訓練集訓練出n個弱學習器。對于分類問題，根據所有的弱學習器的投票，進行“少數服從多數”的原則進行最終預測結果。對于回歸問題，采取所有學習器的平均值作為最終結果。

Boosting算法

Boosting算法本質是將弱學習器提升至強學習器。它和bagging算法區別在于，bagging算法對待所有的弱學習器一視同仁，而boosting算法則對弱學習器區別對待，改變弱學習器的權重。具體表現在：1、在每一論訓練后對預測結果較準確的弱學習器給予較大權重，對不好的弱學習器降低權重。2、在每一輪訓練后改變訓練集的權值或概率分布。通過提高前一輪被弱學習器預測錯誤的樣例的權值，降低前一輪被弱學習器預測正確的樣例權值。提高若學器對預測錯誤的數據的重視程度。

隨機森林模型

1、基本原理

隨機森林（random forest）是一種經典的bagging模型，其弱學習器為決策樹模型。為了保證模型的泛化能力，在建立每棵樹時，遵循“數據隨機”和“特征隨機”兩個基本原則。

1、數據隨機

從所有數據中有放回地隨機抽取數據作為其中一個決策樹模型的訓練數據。

2、特征隨機

假設每個樣本的維度為M，指定一個常數k<M，隨機地從M個特征中選取k個特征，在使用python構造隨機森林模型，默認選取特征的個數k為√M。

隨機森林的代碼實現

隨機森林分類模型的弱學習器是分類決策樹模型，隨機森林回歸模型的弱學習器是回歸決策樹模型。

####隨機森林0-1分類模型 from sklearn.ensemble import RandomForestClassifier X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]] y = [0, 0, 0, 1, 1] model = RandomForestClassifier(n_estimators=10, random_state=123) model.fit(X, y) print(model.predict([[5, 5]]))

n_estimators：若學習器的數量
random_state：隨機種子，

###隨機森林回歸模型 from sklearn.ensemble import RandomForestRegressor X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]] y = [1, 2, 3, 4, 5] model = RandomForestRegressor(n_estimators=10, random_state=123) model.fit(X, y) print(model.predict([[5, 5]]))

案例實戰

量化金融股票

1、引入所需要的庫

import tushare as ts # 股票基本數據相關庫 import numpy as np # 科學計算相關庫 import pandas as pd # 科學計算相關庫 import talib # 股票衍生變量數據相關庫 import matplotlib.pyplot as plt # 引入繪圖相關庫 from sklearn.ensemble import RandomForestClassifier # 引入分類決策樹模型 from sklearn.metrics import accuracy_score # 引入準確度評分函數 import warnings warnings.filterwarnings("ignore") # 忽略警告信息，警告非報錯，不影響代碼執行

2、獲取數據

# 1.股票基本數據獲取 df = ts.get_k_data('000002',start='2015-01-01',end='2019-12-31') df = df.set_index('date') # 設置日期為索引# 2.簡單衍生變量構造 df['close-open'] = (df['close'] - df['open'])/df['open'] df['high-low'] = (df['high'] - df['low'])/df['low']df['pre_close'] = df['close'].shift(1) # 該列所有往下移一行形成昨日收盤價 df['price_change'] = df['close']-df['pre_close'] df['p_change'] = (df['close']-df['pre_close'])/df['pre_close']*100# 3.移動平均線相關數據構造 df['MA5'] = df['close'].rolling(5).mean() df['MA10'] = df['close'].rolling(10).mean() df.dropna(inplace=True) # 刪除空值# 4.通過Ta_lib庫構造衍生變量 df['RSI'] = talib.RSI(df['close'], timeperiod=12) # 相對強弱指標 df['MOM'] = talib.MOM(df['close'], timeperiod=5) # 動量指標 df['EMA12'] = talib.EMA(df['close'], timeperiod=12) # 12日指數移動平均線 df['EMA26'] = talib.EMA(df['close'], timeperiod=26) # 26日指數移動平均線 df['MACD'], df['MACDsignal'], df['MACDhist'] = talib.MACD(df['close'], fastperiod=12, slowperiod=26, signalperiod=9) # MACD值 df.dropna(inplace=True) # 刪除空值

3、特征變量和目標變量提取

X = df[['close', 'volume', 'close-open', 'MA5', 'MA10', 'high-low', 'RSI', 'MOM', 'EMA12', 'MACD', 'MACDsignal', 'MACDhist']] y = np.where(df['price_change'].shift(-1)> 0, 1, -1)

4、劃分訓練集和測試集

X_length = X.shape[0] # shape屬性獲取X的行數和列數，shape[0]即表示行數 split = int(X_length * 0.9) X_train, X_test = X[:split], X[split:] y_train, y_test = y[:split], y[split:]

5、模型構建

model = RandomForestClassifier(max_depth=3, n_estimators=10, min_samples_leaf=10, random_state=1) model.fit(X_train, y_train)

完整的RandomForestClassifier參數

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,criterion='gini', max_depth=3, max_features='auto',max_leaf_nodes=None, max_samples=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=10, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=10,n_jobs=None, oob_score=False, random_state=1, verbose=0,warm_start=False)

6、預測

y_pred = model.predict(X_test) a = pd.DataFrame() # 創建一個空DataFrame a['預測值'] = list(y_pred) a['實際值'] = list(y_test)

用pridict_paoba()函數科研預測各個分類的概率

# 查看預測概率 y_pred_proba = model.predict_proba(X_test) y_pred_proba[0:5]

7、模型準確度評估

from sklearn.metrics import accuracy_score score = accuracy_score(y_pred, y_test) # 此外，我們還可以通過模型自帶的score()函數記性打分，代碼如下： model.score(X_test, y_test)

8、分析特征變量的特征重要性

model.feature_importances_ # 通過如下代碼可以更好的展示特征及其特征重要性： features = X.columns importances = model.feature_importances_ a = pd.DataFrame() a['特征'] = features a['特征重要性'] = importances a = a.sort_values('特征重要性', ascending=False)

參數調優

from sklearn.model_selection import GridSearchCV # 網格搜索合適的超參數 # 指定分類器中參數的范圍 parameters = {'n_estimators':[5, 10, 20], 'max_depth':[2, 3, 4, 5], 'min_samples_leaf':[5, 10, 20, 30]} new_model = RandomForestClassifier(random_state=1) # 構建分類器 grid_search = GridSearchCV(new_model, parameters, cv=6, scoring='accuracy') # cv=6表示交叉驗證6次，scoring='roc_auc'表示以ROC曲線的AUC評分作為模型評價準則, 默認為'accuracy', 即按準確度評分，設置成'roc_auc'表示以ROC曲線的auc值作為評估標準 grid_search.fit(X_train, y_train) # 傳入數據 grid_search.best_params_ # 輸出參數的最優值

9、收益回測曲線繪制

X_test['prediction'] = model.predict(X_test) X_test['p_change'] = (X_test['close'] - X_test['close'].shift(1)) / X_test['close'].shift(1) X_test['origin'] = (X_test['p_change'] + 1).cumprod() X_test['strategy'] = (X_test['prediction'].shift(1) * X_test['p_change'] + 1).cumprod() X_test[['strategy', 'origin']].tail()

# 通過如下代碼將收益情況刪除空值后可視化，并設置X軸刻度自動傾斜： X_test[['strategy', 'origin']].dropna().plot() plt.gcf().autofmt_xdate() plt.show()

總結

以上是生活随笔為你收集整理的随机森林模型的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： VLfeat win10 vs2015
下一篇： Zabbix 3.0 安装

编程问答

随机森林模型

隨機森林模型

集成模型簡介

Bagging算法

Boosting算法

隨機森林模型

1、基本原理

1、數據隨機

2、特征隨機

隨機森林的代碼實現

案例實戰

量化金融股票

參數調優

總結

1、基本原理

1、數據隨機

2、特征隨機