當前位置：首頁 > 编程语言 > python >内容正文

python

【机器学习基础】数学推导+纯Python实现机器学习算法15：GBDT

發(fā)布時間：2025/3/8 python 24 豆豆

生活随笔收集整理的這篇文章主要介紹了【机器学习基础】数学推导+纯Python实现机器学习算法15：GBDT 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Python機器學習算法實現(xiàn)

Author：louwill

Machine Learning Lab

? ? ?

???? 時隔大半年，機器學習算法推導系列終于有時間繼續(xù)更新了。在之前的14講中，筆者將監(jiān)督模型中主要的單模型算法基本都過了一遍。預計在接下來的10講中，筆者將努力更新完以GBDT代表的集成學習模型，以EM算法、CRF和隱馬為代表的概率圖模型以及以聚類降維為代表的無監(jiān)督學習算法。

???? 在系列第4和第5講，筆者集中對ID3和CART決策樹算法進行了闡述，并給出二者算法的一些初步實現(xiàn)。本節(jié)我們來看集成學習的核心模型GBDT(Gradient Boosting Decision Tree)，即梯度提升決策樹，這也是一種決策樹模型算法。GBDT近年來在一些數(shù)據(jù)競賽上大殺四方，并不斷衍生出像XGBoost和LightGBM等更強大的版本。從名字上看，GBDT是由決策樹、提升模型和梯度下降一起構(gòu)成的。所以，要搞清楚GBDT的基本原理，就必須對這三者及其相互作用有一個深入的理解。

GBDT基本原理

???? 決策樹的基本原理我們已經(jīng)很清楚了，就是依據(jù)信息增益等原則不斷選擇特征構(gòu)建樹模型的過程，具體可參考數(shù)學推導+純Python實現(xiàn)機器學習算法5：決策樹之CART算法。Boosting則是一種集成學習模式，通過將多個單個決策樹(弱學習器)進行線性組合構(gòu)成一個強學習器的過程，Boosting以一個單模型作為作為弱分類器，GBDT中使用CART作為這種弱學習器(基模型)。而融入了梯度下降對Boosting樹模型進行優(yōu)化之后就有了梯度提升樹模型。

???? 我們先來用一個通俗的說法來理解GBDT。假設(shè)某位同學月薪10k，筆者先用一個樹模型擬合了6k，發(fā)現(xiàn)有4k的損失，然后再用一棵樹模型擬合了2k，這樣持續(xù)擬合下去，擬合值和目標值之間的殘差會越來越小，而我們將每一輪迭代，也就是每一棵樹的預測值加起來就是模型最終的預測結(jié)果。不停的使用單棵決策樹組合就是Boosting的過程，使用梯度下降對Boosting樹模型進行優(yōu)化的過程就是Gradient Boosting。

???? 下面我們用數(shù)學語言來描述GBDT。

???? 一個提升樹模型可以描述為：

???? 在給定初始模型的情況下，第m步的模型可以表示為：

???? 然后我們通過如下目標函數(shù)來優(yōu)化下一棵樹的參數(shù)：

???? 以回歸問題的提升樹為例展開，一棵回歸樹可表示為：

???? 第0步、第m步和最終模型可表示為：

???? 給定第m-1步的模型下，求解：

???? 當損失函數(shù)為平方損失時：

???? 相應(yīng)的損失可推導為：

???? 則有：

???? 說明提升樹模型每一次迭代是在擬合一個殘差函數(shù)。

???? 但實際工作中并不是每一個損失函數(shù)都如平方損失那樣容易優(yōu)化，所以有學者就提出近似梯度下降的方法來使用損失函數(shù)的負梯度在當前模型的值作為回歸提升樹中殘差的近似值，即：

???? 所以，綜合提升樹和梯度提升，GBDT模型算法的一般流程可歸納為：

(1) 初始化弱學習器:

(2) 對有：

對每個樣本，計算負梯度，即殘差

將上步得到的殘差作為樣本新的真實值，并將數(shù)據(jù)作為下棵樹的訓練數(shù)據(jù)，得到一顆新的回歸樹其對應(yīng)的葉子節(jié)點區(qū)域為。其中為回歸樹t的葉子節(jié)點的個數(shù)。
對葉子區(qū)域計算最佳擬合值

更新強學習器

(3) 得到最終學習器

GBDT代碼框架

???? 手動從頭開始寫一個GBDT模型并非易事，需要我們對GBDT模型算法細節(jié)都有足夠深入的理解。在動手寫代碼之前，我們需要梳理清楚代碼框架，一個完整的GBDT系統(tǒng)應(yīng)包括如下幾個方面，如圖所示。

???? GBDT的基模型為CART，所以定義決策樹結(jié)點和構(gòu)建CART樹至關(guān)重要，CART算法筆者系列第5講已經(jīng)進行了初步實現(xiàn)。當基模型構(gòu)建好后，即可根據(jù)GBDT算法流程搭建GBDT和GBRT。除此之外，一些輔助函數(shù)的定義(最大熵/Gini指數(shù)計算)，損失函數(shù)定義和模型可視化方法等輔助功能也應(yīng)該一應(yīng)俱全。

???? 因樹結(jié)點和CART樹模型第5講已講過，具體實現(xiàn)方法這里不再重寫。

???? 結(jié)點定義代碼框架：

class?TreeNode():def __init__(self, feature_i=None, threshold=None,value=None, true_branch=None, false_branch=None):pass

???? 樹定義代碼框架，主要包括樹的基本屬性和方法。基本屬性包括根結(jié)點、最小劃分樣本數(shù)、最大深度和是否為葉子結(jié)點等等。基本方法包括決策樹構(gòu)建、決策樹擬合、決策樹預測和打印等方法。

class?Tree(object):def __init__(self, min_samples_split=2, min_impurity=1e-7,max_depth=float("inf"), loss=None):self.root = None # Root node in dec. tree# Minimum n of samples to justify splitself.min_samples_split = min_samples_split# The minimum impurity to justify splitself.min_impurity = min_impurity# The maximum depth to grow the tree toself.max_depth = max_depth# Function to calculate impurity (classif.=>info gain, regr=>variance reduct.)# 切割樹的方法，gini，方差等self._impurity_calculation = None# Function to determine prediction of y at leaf# 樹節(jié)點取值的方法，分類樹：選取出現(xiàn)最多次數(shù)的值，回歸樹：取所有值的平均值self._leaf_value_calculation = None# If y is one-hot encoded (multi-dim) or not (one-dim)self.one_dim = None# If Gradient Boostself.loss = lossdef fit(self, X, y, loss=None):"""?Build?decision?tree?"""passdef _build_tree(self, X, y, current_depth=0):""" Recursive method which builds out the decision tree and splits X and respective ypassdef predict_value(self, x, tree=None):""" Do a recursive search down the tree and make a prediction of the data sample by thevalue of the leaf that we end up at """passdef predict(self, X):""" Classify samples one by one and return the set of labels """passdef?print_tree(self,?tree=None,?indent="?"):pass

???? 以回歸樹為例，基于以上樹模型，可定義回歸樹模型如下：

class RegressionTree(Tree):# 使用方差法進行樹分割def _calculate_variance_reduction(self, y, y1, y2):var_tot = calculate_variance(y)var_1 = calculate_variance(y1)var_2 = calculate_variance(y2)frac_1 = len(y1) / len(y)frac_2 = len(y2) / len(y)# Calculate the variance reductionvariance_reduction = var_tot - (frac_1 * var_1 + frac_2 * var_2)return sum(variance_reduction)# 使用均值法取葉子結(jié)點值def _mean_of_y(self, y):value = np.mean(y, axis=0)return value if len(value) > 1 else value[0]# 回歸樹擬合def fit(self, X, y):self._impurity_calculation = self._calculate_variance_reductionself._leaf_value_calculation = self._mean_of_ysuper(RegressionTree, self).fit(X, y)

???? 在定義GBRT之前，先定義損失均方誤差損失函數(shù)：

class Loss(object):def loss(self, y_true, y_pred):return NotImplementedError()def gradient(self, y, y_pred):raise NotImplementedError()def acc(self, y, y_pred):return 0class SquareLoss(Loss):def __init__(self): passdef loss(self, y, y_pred):return 0.5 * np.power((y - y_pred), 2)def gradient(self, y, y_pred):return -(y - y_pred)

???? 然后定義初始版本的GBDT模型：

class GBDT(object):def __init__(self, n_estimators, learning_rate, min_samples_split,min_impurity, max_depth, regression):# 基本參數(shù)self.n_estimators = n_estimatorsself.learning_rate = learning_rateself.min_samples_split = min_samples_splitself.min_impurity = min_impurityself.max_depth = max_depthself.regression = regressionself.loss = SquareLoss()if not self.regression:self.loss = SotfMaxLoss()# 分類問題也可以使用回歸樹，利用殘差去學習概率self.estimators = []for i in range(self.n_estimators):self.estimators.append(RegressionTree(min_samples_split=self.min_samples_split,min_impurity=self.min_impurity,max_depth=self.max_depth))# 擬合方法def fit(self, X, y):# 讓第一棵樹去擬合模型self.estimators[0].fit(X, y)y_pred = self.estimators[0].predict(X)for i in range(1, self.n_estimators):gradient = self.loss.gradient(y, y_pred)self.estimators[i].fit(X, gradient)y_pred -= np.multiply(self.learning_rate, self.estimators[i].predict(X))# 預測方法def predict(self, X):y_pred = self.estimators[0].predict(X)for i in range(1, self.n_estimators):y_pred -= np.multiply(self.learning_rate, self.estimators[i].predict(X))if not self.regression:# Turn into probability distributiony_pred = np.exp(y_pred) / np.expand_dims(np.sum(np.exp(y_pred), axis=1), axis=1)# Set label to the value that maximizes probabilityy_pred = np.argmax(y_pred, axis=1)return y_pred

?????

???? 然后可分別定義GBDT和GBRT：

# regression tree class GBDTRegressor(GBDT):def __init__(self, n_estimators=200, learning_rate=0.5, min_samples_split=2,min_var_red=1e-7, max_depth=4, debug=False):super(GBDTRegressor, self).__init__(n_estimators=n_estimators,learning_rate=learning_rate,min_samples_split=min_samples_split,min_impurity=min_var_red,max_depth=max_depth,regression=True) #?classification?tree class GBDTClassifier(GBDT):def?__init__(self,?n_estimators=200,?learning_rate=.5,?min_samples_split=2,min_info_gain=1e-7, max_depth=2, debug=False):super(GBDTClassifier,?self).__init__(n_estimators=n_estimators,learning_rate=learning_rate,min_samples_split=min_samples_split,min_impurity=min_info_gain,max_depth=max_depth,regression=False)def fit(self, X, y):y = to_categorical(y)super(GBDTClassifier, self).fit(X, y)

?????

???? 最后基于boston房價數(shù)據(jù)集給出一個計算例子：

from?sklearn?import?datasets boston = datasets.load_boston() X, y = shuffle_data(boston.data, boston.target, seed=13) X = X.astype(np.float32) offset = int(X.shape[0] * 0.9)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)model = GBDTRegressor() model.fit(X_train, y_train) y_pred = model.predict(X_test) # Color map cmap?=?plt.get_cmap('viridis') mse?=?mean_squared_error(y_test,?y_pred) print ("Mean Squared Error:", mse)# Plot the results m1 = plt.scatter(range(X_test.shape[0]), y_test, color=cmap(0.5), s=10) m2 = plt.scatter(range(X_test.shape[0]), y_pred, color='black', s=10) plt.suptitle("Regression?Tree") plt.title("MSE:?%.2f"?%?mse,?fontsize=10) plt.xlabel('sample') plt.ylabel('house price') plt.legend((m1, m2), ("Test data", "Prediction"), loc='lower right') plt.show();

???? slearn中為我們提供了GBDT算法完整的API可供調(diào)用，實際工程中更不可能自己手寫這么復雜的算法系統(tǒng)。但作為學習，手寫算法不失為一種深入理解算法細節(jié)和鍛煉代碼能力的好方法。

完整代碼可參考：

https://github.com/RRdmlearning/Machine-Learning-From-Scratch/blob/master/gradient_boosting_decision_tree

參考資料：

https://github.com/RRdmlearning/Machine-Learning-From-Scratch/blob/master/gradient_boosting_decision_tree

李航統(tǒng)計學習方法

往期精彩：

數(shù)學推導+純Python實現(xiàn)機器學習算法14：Ridge嶺回歸

數(shù)學推導+純Python實現(xiàn)機器學習算法13：Lasso回歸

數(shù)學推導+純Python實現(xiàn)機器學習算法12：貝葉斯網(wǎng)絡(luò)

數(shù)學推導+純Python實現(xiàn)機器學習算法11：樸素貝葉斯

數(shù)學推導+純Python實現(xiàn)機器學習算法10：線性不可分支持向量機

數(shù)學推導+純Python實現(xiàn)機器學習算法8-9：線性可分支持向量機和線性支持向量機

數(shù)學推導+純Python實現(xiàn)機器學習算法7：神經(jīng)網(wǎng)絡(luò)

數(shù)學推導+純Python實現(xiàn)機器學習算法6：感知機

數(shù)學推導+純Python實現(xiàn)機器學習算法5：決策樹之CART算法

數(shù)學推導+純Python實現(xiàn)機器學習算法4：決策樹之ID3算法

數(shù)學推導+純Python實現(xiàn)機器學習算法3：k近鄰

數(shù)學推導+純Python實現(xiàn)機器學習算法2：邏輯回歸

數(shù)學推導+純Python實現(xiàn)機器學習算法1：線性回歸

往期精彩回顧適合初學者入門人工智能的路線及資料下載機器學習及深度學習筆記等資料打印機器學習在線手冊深度學習筆記專輯《統(tǒng)計學習方法》的代碼復現(xiàn)專輯 AI基礎(chǔ)下載機器學習的數(shù)學基礎(chǔ)專輯獲取一折本站知識星球優(yōu)惠券，復制鏈接直接打開：https://t.zsxq.com/yFQV7am本站qq群1003271085。加入微信群請掃碼進群：

總結(jié)

以上是生活随笔為你收集整理的【机器学习基础】数学推导+纯Python实现机器学习算法15：GBDT的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： CVPR 2020：如何写一篇好论文？
下一篇： websocket python爬虫_p