dt决策树_决策树:构建DT的分步方法
dt決策樹
介紹 (Introduction)
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.
決策樹(DT)是一種用于分類和回歸的非參數(shù)監(jiān)督學(xué)習(xí)方法。 目標(biāo)是創(chuàng)建一個(gè)模型,該模型通過學(xué)習(xí)從數(shù)據(jù)特征推斷出的簡(jiǎn)單決策規(guī)則來預(yù)測(cè)目標(biāo)變量的值。 決策樹通常用于運(yùn)營(yíng)研究中,尤其是決策分析中,用于幫助確定最有可能達(dá)成目標(biāo)的策略,但它也是機(jī)器學(xué)習(xí)中的一種流行工具。
語境 (Context)
In this article, we will be discussing the following topics
在本文中,我們將討論以下主題
什么是決策樹? (What are Decision trees?)
Fig.1-Decision tree based on yes/no question圖1-基于是/否問題的決策樹The above picture is a simple decision tree. If a person is non-vegetarian, then he/she eats chicken (most probably), otherwise, he/she doesn’t eat chicken. The decision tree, in general, asks a question and classifies the person based on the answer. This decision tree is based on a yes/no question. It is just as simple to build a decision tree on numeric data.
上圖是一個(gè)簡(jiǎn)單的決策樹。 如果一個(gè)人不是素食者,那么他(她)很可能會(huì)吃雞肉,否則,他/她不會(huì)吃雞肉。 通常,決策樹會(huì)提出問題并根據(jù)答案對(duì)人員進(jìn)行分類。 該決策樹基于是/否問題。 在數(shù)字?jǐn)?shù)據(jù)上構(gòu)建決策樹非常簡(jiǎn)單。
Fig.2-Decision tree based on numeric data圖2-基于數(shù)值數(shù)據(jù)的決策樹If a person is driving above 80kmph, we can consider it as over-speeding, else not.
如果一個(gè)人以每小時(shí)80英里的速度行駛,我們可以認(rèn)為它是超速行駛,否則就不行。
Fig.3- Decision tree on ranked data圖3-排序數(shù)據(jù)的決策樹Here is one more simple decision tree. This decision tree is based on ranked data, where 1 means the speed is too high, 2 corresponds to a much lesser speed. If a person is speeding above rank 1 then he/she is highly over-speeding. If the person is above speed rank 2 but below speed rank 1 then he/she is over-speeding but not that much. If the person is below speed rank 2 then he/she is driving well within speed limits.
這是另一種簡(jiǎn)單的決策樹。 該決策樹基于排名數(shù)據(jù),其中1表示速度太高,2表示速度要低得多。 如果一個(gè)人超速超過等級(jí)1,那么他/她就極度超速。 如果此人在速度等級(jí)2以上但在速度等級(jí)1以下,則他/她超速,但不是那么多。 如果該人低于速度等級(jí)2,則他/她在速度限制內(nèi)駕駛得很好。
The classification in a decision tree can be both categoric or numeric.
決策樹中的分類可以是分類的,也可以是數(shù)字的。
Fig.4-Complex DT圖4-復(fù)雜DTHere’s a more complicated decision tree. It combines numeric data with yes/no data. For the most part Decision trees are pretty simple to work with. You start at the top and work your way down till you get to a point where you cant go further. That’s how a sample is classified.
這是一個(gè)更復(fù)雜的決策樹。 它將數(shù)字?jǐn)?shù)據(jù)與是/否數(shù)據(jù)組合在一起。 在大多數(shù)情況下,決策樹非常簡(jiǎn)單。 您從頂部開始,一直往下走,直到到達(dá)無法繼續(xù)前進(jìn)的地步。 這就是樣本分類的方式。
The very top of the tree is called the root node or just the root. The nodes in between are called internal nodes. Internal nodes have arrows pointing to them and arrows pointing away from them. The end nodes are called the leaf nodes or just leaves. Leaf nodes have arrows pointing to them but no arrows pointing away from them.
樹的最頂端稱為根節(jié)點(diǎn) 或只是根。 中間的節(jié)點(diǎn)稱為內(nèi)部節(jié)點(diǎn) 。 內(nèi)部節(jié)點(diǎn)具有指向它們的箭頭和指向遠(yuǎn)離它們的箭頭。 末端節(jié)點(diǎn)稱為葉子節(jié)點(diǎn)或僅稱為葉子 。 葉節(jié)點(diǎn)有指向它們的箭頭,但沒有指向遠(yuǎn)離它們的箭頭。
In the above diagrams, root nodes are represented by rectangles, internal nodes by circles, and leaf nodes by inverted-triangles.
在上圖中,根節(jié)點(diǎn)用矩形表示,內(nèi)部節(jié)點(diǎn)用圓形表示,葉節(jié)點(diǎn)用倒三角形表示。
建立決策樹 (Building a Decision tree)
There are several algorithms to build a decision tree.
有幾種算法可以構(gòu)建決策樹。
We will be discussing only CART and ID3 algorithms as they are the ones majorly used.
我們將僅討論CART和ID3算法,因?yàn)樗鼈兪亲畛S玫乃惴ā?
大車 (CART)
CART is a DT algorithm that produces binary Classification or Regression Trees, depending on whether the dependent (or target) variable is categorical or numeric, respectively. It handles data in its raw form (no preprocessing needed) and can use the same variables more than once in different parts of the same DT, which may uncover complex interdependencies between sets of variables.
CART是一種DT算法,它分別根據(jù)因變量(或目標(biāo)變量)是分類變量還是數(shù)字變量而生成二進(jìn)制 分類樹或回歸樹。 它以原始格式處理數(shù)據(jù)(無需預(yù)處理),并且可以在同一DT的不同部分中多次使用相同的變量,這可能揭示變量集之間的復(fù)雜相互依賴性。
Fig.5- Sample dataset圖5-樣本數(shù)據(jù)集Now we are going to discuss how to build a decision tree from a raw table of data. In the example given above, we will be building a decision tree that uses chest pain, good blood circulation, and the status of blocked arteries to predict if a person has heart disease or not.
現(xiàn)在,我們將討論如何從原始數(shù)據(jù)表構(gòu)建決策樹。 在上面給出的示例中,我們將構(gòu)建一個(gè)決策樹,該決策樹使用胸痛,良好的血液循環(huán)以及動(dòng)脈阻塞的狀態(tài)來預(yù)測(cè)一個(gè)人是否患有心臟病。
The first thing we have to know is which feature should be on the top or in the root node of the tree. We start by looking at how chest pain alone predicts heart disease.
我們首先要知道的是哪個(gè)功能應(yīng)該在樹的頂部或根節(jié)點(diǎn)中。 我們首先來看僅胸痛是如何預(yù)示心臟病的。
Fig.6-Chest pain as the root node圖6-胸痛為根節(jié)點(diǎn)There are two leaf nodes, one each for the two outcomes of chest pain. Each of the leaves contains the no. of patients having heart disease and not having heart disease for the corresponding entry of chest pain. Now we do the same thing for good blood circulation and blocked arteries.
有兩個(gè)葉結(jié),每個(gè)胸結(jié)分別導(dǎo)致兩種胸痛。 每片葉子都包含no。 患有心臟病而沒有心臟病的患者中,有相應(yīng)的胸痛會(huì)進(jìn)入。 現(xiàn)在,我們?yōu)檠貉h(huán)良好和動(dòng)脈阻塞做了同樣的事情。
Fig.7-Good blood circulation as the root node圖7-良好的血液循環(huán)為根 Fig.8-Blocked arteries as the root node圖8阻塞的動(dòng)脈為根節(jié)點(diǎn)We can see that neither of the 3 features separates the patients having heart disease from the patients not having heart disease perfectly. It is to be noted that the total no. of patients having heart disease is different in all three cases. This is done to simulate the missing values present in real-world datasets.
我們可以看到,這三個(gè)特征都沒有將患有心臟病的患者與沒有患有心臟病的患者完美地分開。 要注意的是總數(shù)。 在這三種情況下,患有心臟病的患者的比例均不同。 這樣做是為了模擬現(xiàn)實(shí)數(shù)據(jù)集中存在的缺失值。
Because none of the leaf nodes is either 100% ‘yes heart disease’ or 100% ‘no heart disease’, they are all considered impure. To decide on which separation is the best, we need a method to measure and compare impurity.
因?yàn)樗腥~節(jié)點(diǎn)都不是100%“是心臟病”或100%“沒有心臟病”,所以它們都被認(rèn)為是不純的。 為了確定哪種分離最好,我們需要一種測(cè)量和比較雜質(zhì)的方法。
The metric used in the CART algorithm to measure impurity is the Gini impurity score. Calculating Gini impurity is very easy. Let’s start by calculating the Gini impurity for chest pain.
CART算法中用于測(cè)量雜質(zhì)的度量標(biāo)準(zhǔn)是基尼雜質(zhì)評(píng)分 。 計(jì)算基尼雜質(zhì)非常簡(jiǎn)單。 讓我們從計(jì)算吉尼雜質(zhì)引起的胸痛開始。
Fig.9- Chest pain separation圖9-胸痛分離For the left leaf,
對(duì)于左葉
Gini impurity = 1 - (probability of ‘yes’)2 - (probability of ‘no’)2= 1 - (105/105+39)2 - (39/105+39)2
Gini impurity = 0.395
Similarly, calculate the Gini impurity for the right leaf node.
同樣,計(jì)算右葉節(jié)點(diǎn)的基尼雜質(zhì)。
Gini impurity = 1 - (probability of ‘yes’)2 - (probability of ‘no’)2= 1 - (34/34+125)2 - (125/34+125)2
Gini impurity = 0.336
Now that we have measured the Gini impurity for both leaf nodes, we can calculate the total Gini impurity for using chest pain to separate patients with and without heart disease.
既然我們已經(jīng)測(cè)量了兩個(gè)葉節(jié)點(diǎn)的Gini雜質(zhì),我們就可以計(jì)算出總的Gini雜質(zhì),使用胸部疼痛來區(qū)分患有和不患有心臟病的患者。
The leaf nodes do not represent the same no. of patients as the left leaf represents 144 patients and the right leaf represents 159 patients. Thus the total Gini impurity will be the weighted average of the leaf node Gini impurities.
葉節(jié)點(diǎn)不代表相同的編號(hào)。 的患者,因?yàn)樽笕~代表144位患者,右葉代表159位患者。 因此,總基尼雜質(zhì)將是葉節(jié)點(diǎn)基尼雜質(zhì)的加權(quán)平均值。
Gini impurity = (144/144+159)*0.395 + (159/144+159)*0.336= 0.364
Similarly the total Gini impurity for ‘good blood circulation’ and ‘blocked arteries’ is calculated as
同樣,“良好血液循環(huán)”和“動(dòng)脈阻塞”的總基尼雜質(zhì)計(jì)算如下:
Gini impurity for ‘good blood circulation’ = 0.360Gini impurity for ‘blocked arteries’ = 0.381
‘Good blood circulation’ has the lowest impurity score among the tree which symbolizes that it best separates the patients having and not having heart disease, so we will use it at the root node.
“良好的血液循環(huán)”在樹中具有最低的雜質(zhì)評(píng)分,這表示它可以最好地區(qū)分患有和不患有心臟病的患者,因此我們將在根結(jié)點(diǎn)使用它。
Fig.10-Good blood circulation at the root node圖10-根結(jié)點(diǎn)血液循環(huán)良好Now we need to figure out how well ‘chest pain’ and ‘blocked arteries’ separate the 164 patients in the left node(37 with heart disease and 127 without heart disease).
現(xiàn)在,我們需要弄清楚“胸痛”和“動(dòng)脈阻塞”對(duì)左結(jié)的164例患者(37例有心臟病和127例無心臟病)的分隔情況。
Just like we did before we will separate these patients with ‘chest pain’ and calculate the Gini impurity value.
就像我們之前所做的那樣,我們將這些患有“胸痛”的患者分開,并計(jì)算出基尼雜質(zhì)值。
Fig.11- Chest pain separation圖11-胸痛分離The Gini impurity was found to be 0.3. Then we do the same thing for ‘blocked arteries’.
發(fā)現(xiàn)基尼雜質(zhì)為0.3。 然后,我們對(duì)“阻塞的動(dòng)脈”執(zhí)行相同的操作。
Fig.12-Blocked arteries separation圖12動(dòng)脈阻塞The Gini impurity was found to be 0.29. Since ‘blocked arteries’ has the lowest Gini impurity, we will use it at the left node in Fig.10 for further separating the patients.
發(fā)現(xiàn)基尼雜質(zhì)為0.29。 由于“阻塞的動(dòng)脈”具有最低的基尼雜質(zhì),因此我們將在圖10的左側(cè)節(jié)點(diǎn)使用它來進(jìn)一步分離患者。
Fig.13-Blocked arteries separation圖13阻塞的動(dòng)脈All we have left is ‘chest pain’, so we will see how well it separates the 49 patients in the left node(24 with heart disease and 25 without heart disease).
我們只剩下“胸痛”,因此我們將看到它如何很好地分隔了左結(jié)中的49位患者(24位有心臟病和25位無心臟病)。
Fig.14-Chest pain separation in the left node圖14-左結(jié)節(jié)胸痛分離We can see that chest pain does a good job separating the patients.
我們可以看到,胸痛在分隔患者方面做得很好。
Fig.15-Final chest pain separation圖15-最終胸痛分離So these are the final leaf nodes of the left side of this branch of the tree. Now let’s see what happens when we try to separate the node having 13/102 patients using ‘chest pain’. Note that almost 90% of the people in this node are not having heart disease.
因此,這些是樹的此分支左側(cè)的最終葉節(jié)點(diǎn)。 現(xiàn)在,讓我們看看當(dāng)嘗試使用“胸痛”分離具有13/102個(gè)患者的節(jié)點(diǎn)時(shí)會(huì)發(fā)生什么。 請(qǐng)注意,此節(jié)點(diǎn)中幾乎90%的人沒有心臟病。
Fig.16-Chest pain separation on the right node圖16-右結(jié)節(jié)胸痛分離The Gini impurity of this separation is 0.29. But the Gini impurity for the parent-node before using chest-pain to separate the patients is
該分離的基尼雜質(zhì)為0.29。 但是在使用胸痛將患者分開之前,父節(jié)點(diǎn)的基尼雜質(zhì)是
Gini impurity = 1 - (probability of yes)2 - (probability of no)2= 1 - (13/13+102)2 - (102/13+102)2
Gini impurity = 0.2
The impurity is lower if we don’t separate patients using ‘chest pain’. So we will make it a leaf-node.
如果我們不使用“胸痛”將患者分開,那么雜質(zhì)就更少。 因此,我們將其設(shè)為葉節(jié)點(diǎn)。
Fig.17-Left side completed圖17-左側(cè)完成At this point, we have worked out the entire left side of the tree. The same steps are to be followed to work out the right side of the tree.
至此,我們已經(jīng)算出了樹的整個(gè)左側(cè)。 遵循相同的步驟來計(jì)算樹的右側(cè)。
ID3 (ID3)
The process of building a decision tree using the ID3 algorithm is almost similar to using the CART algorithm except for the method used for measuring purity/impurity. The metric used in the ID3 algorithm to measure purity is called Entropy.
除了用于測(cè)量純度/雜質(zhì)的方法外,使用ID3算法構(gòu)建決策樹的過程幾乎與使用CART算法相似。 ID3算法中用于測(cè)量純度的度量標(biāo)準(zhǔn)稱為熵 。
Entropy is a way to measure the uncertainty of a class in a subset of examples. Assume item belongs to subset S having two classes positive and negative. Entropy is defined as the no. of bits needed to say whether x is positive or negative.
熵是一種在子集中的示例中衡量類的不確定性的方法。 假設(shè)項(xiàng)屬于具有正負(fù)兩個(gè)類別的子集S。 熵定義為否。 需要說出x是正還是負(fù)的位。
Entropy always gives a number between 0 and 1. So if a subset formed after separation using an attribute is pure, then we will be needing zero bits to tell if is positive or negative. If the subset formed is having equal no. of positive and negative items then the no. of bits needed would be 1.
熵總是給出一個(gè)介于0和1之間的數(shù)字。因此,如果使用屬性進(jìn)行分離后形成的子集是純凈的,那么我們將需要零位來判斷它是正還是負(fù)。 如果形成的子集具有相等的否。 積極和消極的項(xiàng)目,然后沒有。 需要的位數(shù)為1。
Fig-19.Entropy vs. p(+)圖19熵vs.p(+)The above plot shows the relation between entropy and i.e., the probability of positive class. As we can see, the entropy reaches 1 which is the maximum value when which is there are equal chances for an item to be either positive or negative. The entropy is at its minimum when p(+) tends to zero(symbolizing x is negative) or 1(symbolizing x is positive).
上圖顯示了熵與正類別概率之間的關(guān)系。 正如我們所看到的,熵達(dá)到1時(shí),這是最大值,當(dāng)一個(gè)項(xiàng)目有相等的機(jī)會(huì)成為正數(shù)或負(fù)數(shù)時(shí)。 當(dāng)p(+)趨于零(象征x為負(fù))或1(象征x為正)時(shí),熵處于最小值。
Entropy tells us how pure or impure each subset is after the split. What we need to do is aggregate these scores to check whether the split is feasible or not. This is done by Information gain.
熵告訴我們分割后每個(gè)子集的純或不純。 我們需要做的是匯總這些分?jǐn)?shù),以檢查拆分是否可行。 這是通過信息獲取來完成的。
Fig-20.Building an ID3 tree圖20:建立一個(gè)ID3樹Consider this part of the problem we discussed above for the CART algorithm. We need to decide which attribute to use from chest pain and blocked arteries for separating the left node containing 164 patients(37 having heart disease and 127 not having heart disease). We can calculate the entropy before splitting as
考慮上面我們針對(duì)CART算法討論的問題的這一部分。 我們需要決定從chest pain和blocked arteries使用哪個(gè)屬性來分離左結(jié),該左結(jié)包含164位患者(37位患有心臟病和127位沒有心臟病)。 我們可以在分解為
Let’s see how well chest pain separates the patients
讓我們看看chest pain如何使患者分開
Fig.21- Chest pain separation圖21-胸痛分離The entropy for the left node can be calculated
可以計(jì)算出左節(jié)點(diǎn)的熵
Similarly the entropy for the right node
類似地,右節(jié)點(diǎn)的熵
The total gain in entropy after splitting using chest pain
使用chest pain分裂后的總熵增加
This implies that if in the current situation if we were to pick chest pain for splitting the patients, we would gain 0.098 bits in certainty on the patient having or not having heart disease. Doing the same for blocked arteries , the gain obtained was 0.117. Since splitting with blocked arteries gives us more certainty, it would be picked. We can repeat the same procedure for all the nodes to build a DT based on the ID3 algorithm.
這意味著,如果在當(dāng)前情況下,如果我們?yōu)榱俗尰颊叻謸?dān)而chest pain ,那么在患有或沒有心臟病的患者中,我們將獲得0.098位的確定性。 對(duì)blocked arteries進(jìn)行相同的操作,獲得的增益為0.117。 由于blocked arteries分裂 給我們更多的確定性,它將被選中。 我們可以對(duì)所有節(jié)點(diǎn)重復(fù)相同的過程,以基于ID3算法構(gòu)建DT。
Note: The decision of whether to split a node into 2 or to declare it as a leaf node can be made by imposing a minimum threshold on the gain value required. If the acquired gain is above the threshold value, we can split the node, otherwise, leave it as a leaf node.
注意:通過將最小閾值強(qiáng)加在所需的增益值上,可以決定是將節(jié)點(diǎn)拆分為2還是將其聲明為葉節(jié)點(diǎn)。 如果獲取的增益高于閾值,則可以拆分節(jié)點(diǎn),否則將其保留為葉節(jié)點(diǎn)。
摘要 (Summary)
The following are the take-aways from this article
以下是本文的摘錄
翻譯自: https://towardsdatascience.com/decision-trees-a-step-by-step-approach-to-building-dts-58f8a3e82596
dt決策樹
總結(jié)
以上是生活随笔為你收集整理的dt决策树_决策树:构建DT的分步方法的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到自己头发秃了两块是什么预兆
- 下一篇: 已知两点坐标拾取怎么操作_已知的操作员学