分类决策树 回归决策树_决策树分类器背后的数学
分類決策樹 回歸決策樹
決策樹分類器背后的數(shù)學(xué) (Maths behind Decision Tree Classifier)
Before we see the python implementation of the decision tree. Let’s first understand the math behind the decision tree classification. We will see how all the above-mentioned terms are used for splitting.
在我們看到?jīng)Q策樹的python實(shí)現(xiàn)之前。 首先讓我們了解決策樹分類背后的數(shù)學(xué)原理。 我們將看到如何使用所有上述術(shù)語(yǔ)進(jìn)行拆分。
We will use a simple dataset which contains information about students from different classes and gender and see whether they stay in the school’s hostel or not.
我們將使用一個(gè)簡(jiǎn)單的數(shù)據(jù)集,其中包含有關(guān)來(lái)自不同班級(jí)和性別的學(xué)生的信息,并查看他們是否留在學(xué)校的宿舍中。
This is how our data set looks like :
這就是我們的數(shù)據(jù)集的樣子:
Let’s try and understand how the root node is selected by calcualting gini impurity. We will use the above mentioned data.
讓我們嘗試了解如何通過(guò)計(jì)算基尼雜質(zhì)來(lái)選擇根節(jié)點(diǎn)。 我們將使用上述數(shù)據(jù)。
We have two features which we can use for nodes: “Class” and “Gender”. We will calculate gini impurity for each of the features and then select that feature which has least gini impurity.
我們有兩個(gè)可用于節(jié)點(diǎn)的功能:“類”和“性別”。 我們將為每個(gè)特征計(jì)算基尼雜質(zhì),然后選擇基尼雜質(zhì)最少的特征。
Let’s review the formula for calculating ginni impurity:
讓我們回顧一下計(jì)算ginni雜質(zhì)的公式:
Let’s start with class, we will try to gini impurity for all different values in “class”.
讓我們從類開始,我們將嘗試為“類”中的所有不同值添加雜質(zhì)。
This is how our Decision tree node is selected by calculating gini impurity for each node individually. If the number of feautures increases, then we just need to repeat the same steps after the selection of the root node.
這就是通過(guò)分別計(jì)算每個(gè)節(jié)點(diǎn)的基尼雜質(zhì)來(lái)選擇我們的決策樹節(jié)點(diǎn)的方式。 如果功能數(shù)量增加,那么我們只需要在選擇根節(jié)點(diǎn)之后重復(fù)相同的步驟即可。
We will try and find the root nodes for the same dataset by calculating entropy and information gain.
我們將通過(guò)計(jì)算熵和信息增益來(lái)嘗試找到同一數(shù)據(jù)集的根節(jié)點(diǎn)。
DataSet:
數(shù)據(jù)集:
We have two features and we will try to choose the root node by calculating the information gain by splitting each feature.
我們有兩個(gè)功能,我們將嘗試通過(guò)拆分每個(gè)功能來(lái)計(jì)算信息增益來(lái)選擇根節(jié)點(diǎn)。
Let’ review the formula for entropy and information gain:
讓我們回顧一下熵和信息增益的公式:
Let’s start with feature “class” :
讓我們從功能“類”開始:
Let’ see the information gain from feature “gender” :
讓我們看看從“性別”功能獲得的信息:
決策樹的不同算法 (Different Algorithms for Decision Tree)
- ID3 (Iterative Dichotomiser) : It is one of the algorithms used to construct decision tree for classification. It uses Information gain as the criteria for finding the root nodes and splitting them. It only accepts categorical attributes. ID3(迭代二分器):這是用于構(gòu)建決策樹以進(jìn)行分類的算法之一。 它使用信息增益作為查找根節(jié)點(diǎn)并將其拆分的標(biāo)準(zhǔn)。 它僅接受分類屬性。
- C4.5 : It is an extension of ID3 algorithm, and better than ID3 as it deals both continuous and discreet values.It is also used for classfication purposes. C4.5:它是ID3算法的擴(kuò)展,比ID3更好,因?yàn)樗忍幚磉B續(xù)值又處理離散值,也用于分類目的。
- Classfication and Regression Algorithm(CART) : It is the most popular algorithm used for constructing decison trees. It uses ginni impurity as the default calculation for selecting root nodes, however one can use “entropy” for criteria as well. This algorithm works on both regression as well as classfication problems. We will use this algorithm in our pyhton implementation. 分類和回歸算法(CART):這是用于構(gòu)建決策樹的最流行算法。 它使用ginni雜質(zhì)作為選擇根節(jié)點(diǎn)的默認(rèn)計(jì)算,但是也可以使用“熵”作為標(biāo)準(zhǔn)。 該算法適用于回歸和分類問(wèn)題。 我們將在pyhton實(shí)現(xiàn)中使用此算法。
Entropy and Ginni impurity can be used reversibly. It doesn’t affects the result much. Although, ginni is easier to compute than entropy, since entropy has a log term calculation. That’s why CART algorithm uses ginni as the default algorithm.
熵和Ginni雜質(zhì)可以可逆地使用。 它對(duì)結(jié)果的影響不大。 盡管ginni比熵更容易計(jì)算,因?yàn)殪鼐哂袑?duì)數(shù)項(xiàng)計(jì)算。 這就是CART算法使用ginni作為默認(rèn)算法的原因。
If we plot ginni vs entropy graph, we can see there is not much difference between them:
如果我們繪制ginni vs熵圖,我們可以看到它們之間沒(méi)有太大的區(qū)別:
Advantages of Decision Tree:
決策樹的優(yōu)勢(shì):
- It can be used for both Regression and Classification problems. 它可以用于回歸和分類問(wèn)題。
- Decision Trees are very easy to grasp as the rules of splitting is clearly mentioned. 決策樹很容易掌握,因?yàn)槊鞔_提到了拆分規(guī)則。
- Complex decision tree models are very simple when visualized. It can be understood just by visualizing. 可視化時(shí),復(fù)雜的決策樹模型非常簡(jiǎn)單。 僅僅通過(guò)可視化就可以理解。
- Scaling and normalization are not needed. 不需要縮放和規(guī)范化。
Disadvantages of Decision Tree:
決策樹的缺點(diǎn):
- A small change in data can cause instability in the model because of the greedy approach. 由于貪婪的方法,數(shù)據(jù)的微小變化會(huì)導(dǎo)致模型不穩(wěn)定。
- Probability of overfitting is very high for Decision Trees. 對(duì)于決策樹,過(guò)度擬合的可能性非常高。
- It takes more time to train a decision tree model than other classification algorithms. 與其他分類算法相比,訓(xùn)練決策樹模型需要更多時(shí)間。
翻譯自: https://medium.com/@er.amansingh2019/maths-behind-decision-tree-classifier-e3bfd5445540
分類決策樹 回歸決策樹
總結(jié)
以上是生活随笔為你收集整理的分类决策树 回归决策树_决策树分类器背后的数学的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 分析称全球智能手机销量最快今年第二季度回
- 下一篇: 检测对抗样本_对抗T恤以逃避ML人检测器