當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

决策树Decision Tree 及实现

發(fā)布時間：2025/3/21 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了决策树Decision Tree 及实现小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

本文基于python逐步實現(xiàn)Decision Tree(決策樹)，分為以下幾個步驟：

加載數(shù)據(jù)集
熵的計算
根據(jù)最佳分割feature進(jìn)行數(shù)據(jù)分割
根據(jù)最大信息增益選擇最佳分割feature
遞歸構(gòu)建決策樹
樣本分類

關(guān)于決策樹的理論方面本文幾乎不講，詳情請google keywords:“決策樹信息增益 ?熵”

將分別體現(xiàn)于代碼。

本文只建一個.py文件，所有代碼都在這個py里

1.加載數(shù)據(jù)集

我們選用UCI經(jīng)典Iris為例

Brief of IRIS:

Data Set Characteristics:??	Multivariate	Number of Instances:	150	Area:	Life
Attribute Characteristics:	Real	Number of Attributes:	4	Date Donated	1988-07-01
Associated Tasks:	Classification	Missing Values?	No	Number of Web Hits:	533125

Code：

[python]?view plaincopy

from?numpy?import?*??

#load?"iris.data"?to?workspace??

traindata?=?loadtxt("D:\ZJU_Projects\machine?learning\ML_Action\Dataset\Iris.data",delimiter?=?',',usecols?=?(0,1,2,3),dtype?=?float)??

trainlabel?=?loadtxt("D:\ZJU_Projects\machine?learning\ML_Action\Dataset\Iris.data",delimiter?=?',',usecols?=?(range(4,5)),dtype?=?str)??

feaname?=?["#0","#1","#2","#3"]?#?feature?names?of?the?4?attributes?(features)??

Result :

? ? ? ? ? ?

左圖為實際數(shù)據(jù)集，四個離散型feature，一個label表示類別（有Iris-setosa, Iris-versicolor，Iris-virginica?三個類）

2.?熵的計算

entropy是香農(nóng)提出來的（信息論大牛），定義見wiki

注意這里的entropy是H(C|X=xi)而非H(C|X), H（C|X）的計算見第下一個點(diǎn)，還要乘以概率加和

Code：

[python]?view plaincopy

from?math?import?log??

def?calentropy(label):??

????n?=?label.size?#?the?number?of?samples??

????#print?n??

????count?=?{}?#create?dictionary?"count"??

????for?curlabel?in?label:??

????????if?curlabel?not?in?count.keys():??

????????????count[curlabel]?=?0??

????????count[curlabel]?+=?1??

????entropy?=?0??

????#print?count??

????for?key?in?count:??

????????pxi?=?float(count[key])/n?#notice?transfering?to?float?first??

????????entropy?-=?pxi*log(pxi,2)??

????return?entropy??

#testcode:??

#x?=?calentropy(trainlabel)??

Result：

3.?根據(jù)最佳分割feature進(jìn)行數(shù)據(jù)分割

假定我們已經(jīng)得到了最佳分割feature，在這里進(jìn)行分割（最佳feature為splitfea_idx）

第二個函數(shù)idx2data是根據(jù)splitdata得到的分割數(shù)據(jù)的兩個index集合返回datal (samples less than pivot), datag(samples greater than pivot), labell, labelg。這里我們根據(jù)所選特征的平均值作為pivot

[python]?view plaincopy

#split?the?dataset?according?to?label?"splitfea_idx"??

def?splitdata(oridata,splitfea_idx):??

????arg?=?args[splitfea_idx]?#get?the?average?over?all?dimensions??

????idx_less?=?[]?#create?new?list?including?data?with?feature?less?than?pivot??

????idx_greater?=?[]?#includes?entries?with?feature?greater?than?pivot??

????n?=?len(oridata)??

????for?idx?in?range(n):??

????????d?=?oridata[idx]??

????????if?d[splitfea_idx]?<?arg:??

????????????#add?the?newentry?into?newdata_less?set??

????????????idx_less.append(idx)??

????????else:??

????????????idx_greater.append(idx)??

????return?idx_less,idx_greater??

#testcode:2??

#idx_less,idx_greater?=?splitdata(traindata,2)??

#give?the?data?and?labels?according?to?index??

def?idx2data(oridata,label,splitidx,fea_idx):??

????idxl?=?splitidx[0]?#split_less_indices??

????idxg?=?splitidx[1]?#split_greater_indices??

????datal?=?[]??

????datag?=?[]??

????labell?=?[]??

????labelg?=?[]??

????for?i?in?idxl:??

????????datal.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))??

????for?i?in?idxg:??

????????datag.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))??

????labell?=?label[idxl]??

????labelg?=?label[idxg]??

????return?datal,datag,labell,labelg??

這里args是參數(shù)，決定分裂節(jié)點(diǎn)的閾值（每個參數(shù)對應(yīng)一個feature，大于該值分到>branch，小于該值分到<branch）,我們可以定義如下：

[python]?view plaincopy

args?=?mean(traindata,axis?=?0)??

測試：按特征2進(jìn)行分類，得到的less和greater set of indices分別為：

也就是按args[2]進(jìn)行樣本集分割，<和>args[2]的branch分別有57和93個樣本。

4. 根據(jù)最大信息增益選擇最佳分割feature

信息增益為代碼中的info_gain, 注釋中是熵的計算

[python]?view plaincopy

#select?the?best?branch?to?split??

def?choosebest_splitnode(oridata,label):??

????n_fea?=?len(oridata[0])??

????n?=?len(label)??

????base_entropy?=?calentropy(label)??

????best_gain?=?-1??

????for?fea_i?in?range(n_fea):?#calculate?entropy?under?each?splitting?feature??

????????cur_entropy?=?0??

????????idxset_less,idxset_greater?=?splitdata(oridata,fea_i)??

????????prob_less?=?float(len(idxset_less))/n??

????????prob_greater?=?float(len(idxset_greater))/n??

??????????

????????#entropy(value|X)?=?\sum{p(xi)*entropy(value|X=xi)}??

????????cur_entropy?+=?prob_less*calentropy(label[idxset_less])??

????????cur_entropy?+=?prob_greater?*?calentropy(label[idxset_greater])??

??????????

????????info_gain?=?base_entropy?-?cur_entropy?#notice?gain?is?before?minus?after??

????????if(info_gain>best_gain):??

????????????best_gain?=?info_gain??

????????????best_idx?=?fea_i??

????return?best_idx????

#testcode:??

#x?=?choosebest_splitnode(traindata,trainlabel)??

這里的測試針對所有數(shù)據(jù)，分裂一次選擇哪個特征呢？

5.?遞歸構(gòu)建決策樹

詳見code注釋，buildtree遞歸地構(gòu)建樹。

遞歸終止條件：

①該branch內(nèi)沒有樣本（subset為空） or

②分割出的所有樣本屬于同一類?or?

③由于每次分割消耗一個feature，當(dāng)沒有feature的時候停止遞歸，返回當(dāng)前樣本集中大多數(shù)sample的label

[python]?view plaincopy

#create?the?decision?tree?based?on?information?gain??

def?buildtree(oridata,?label):??

????if?label.size==0:?#if?no?samples?belong?to?this?branch??

????????return?"NULL"??

????listlabel?=?label.tolist()??

????#stop?when?all?samples?in?this?subset?belongs?to?one?class??

????if?listlabel.count(label[0])==label.size:??

????????return?label[0]??

??????????

????#return?the?majority?of?samples'?label?in?this?subset?if?no?extra?features?avaliable??

????if?len(feanamecopy)==0:??

????????cnt?=?{}??

????????for?cur_l?in?label:??

????????????if?cur_l?not?in?cnt.keys():??

????????????????cnt[cur_l]?=?0??

????????????cnt[cur_l]?+=?1??

????????maxx?=?-1???

????????for?keys?in?cnt:??

????????????if?maxx?<?cnt[keys]:??

????????????????maxx?=?cnt[keys]??

????????????????maxkey?=?keys??

????????return?maxkey??

??????

????bestsplit_fea?=?choosebest_splitnode(oridata,label)?#get?the?best?splitting?feature??

????print?bestsplit_fea,len(oridata[0])??

????cur_feaname?=?feanamecopy[bestsplit_fea]?#?add?the?feature?name?to?dictionary??

????print?cur_feaname??

????nodedict?=?{cur_feaname:{}}???

????del(feanamecopy[bestsplit_fea])?#delete?current?feature?from?feaname??

????split_idx?=?splitdata(oridata,bestsplit_fea)?#split_idx:?the?split?index?for?both?less?and?greater??

????data_less,data_greater,label_less,label_greater?=?idx2data(oridata,label,split_idx,bestsplit_fea)??

??????

????#build?the?tree?recursively,?the?left?and?right?tree?are?the?"<"?and?">"?branch,?respectively??

????nodedict[cur_feaname]["<"]?=?buildtree(data_less,label_less)??

????nodedict[cur_feaname][">"]?=?buildtree(data_greater,label_greater)??

????return?nodedict??

??????

#testcode:??

#mytree?=?buildtree(traindata,trainlabel)??

#print?mytree??

Result:

mytree就是我們的結(jié)果，#1表示當(dāng)前使用第一個feature做分割，'<'和'>'分別對應(yīng)less 和 greater的數(shù)據(jù)。

6. 樣本分類

根據(jù)構(gòu)建出的mytree進(jìn)行分類，遞歸走分支

[python]?view plaincopy

#classify?a?new?sample??

def?classify(mytree,testdata):??

????if?type(mytree).__name__?!=?'dict':??

????????return?mytree??

????fea_name?=?mytree.keys()[0]?#get?the?name?of?first?feature??

????fea_idx?=?feaname.index(fea_name)?#the?index?of?feature?'fea_name'??

????val?=?testdata[fea_idx]??

????nextbranch?=?mytree[fea_name]??

??????

????#judge?the?current?value?>?or?<?the?pivot?(average)??

????if?val>args[fea_idx]:??

????????nextbranch?=?nextbranch[">"]??

????else:??

????????nextbranch?=?nextbranch["<"]??

????return?classify(nextbranch,testdata)??

#testcode??

tt?=?traindata[0]??

x?=?classify(mytree,tt)??

print?x??

Result：

為了驗證代碼準(zhǔn)確性，我們換一下args參數(shù)，把它們都設(shè)成0（很小）

args = [0,0,0,0]

建樹和分類的結(jié)果如下：

可見沒有小于pivot(0)的項，于是dict中每個<的key對應(yīng)的value都為空。

本文中全部代碼下載：決策樹python實現(xiàn)

Reference:?Machine Learning in Action

from:?http://blog.csdn.net/abcjennifer/article/details/20905311

《新程序員》：云原生和全面數(shù)字化實踐50位技術(shù)專家共同創(chuàng)作，文字、視頻、音頻交互閱讀

總結(jié)

以上是生活随笔為你收集整理的决策树Decision Tree 及实现的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： opencv 人脸识别（二）训练和识别
下一篇： Image classification