日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

决策树Decision Tree 及实现

發(fā)布時間:2025/3/21 编程问答 25 豆豆
生活随笔 收集整理的這篇文章主要介紹了 决策树Decision Tree 及实现 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

本文基于python逐步實現(xiàn)Decision Tree(決策樹),分為以下幾個步驟:

  • 加載數(shù)據(jù)集
  • 熵的計算
  • 根據(jù)最佳分割feature進(jìn)行數(shù)據(jù)分割
  • 根據(jù)最大信息增益選擇最佳分割feature
  • 遞歸構(gòu)建決策樹
  • 樣本分類

關(guān)于決策樹的理論方面本文幾乎不講,詳情請google keywords:“決策樹 信息增益 ?熵”

將分別體現(xiàn)于代碼。

本文只建一個.py文件,所有代碼都在這個py里



1.加載數(shù)據(jù)集

我們選用UCI經(jīng)典Iris為例

Brief of IRIS:

Data Set Characteristics:??

Multivariate

Number of Instances:

150

Area:

Life

Attribute Characteristics:

Real

Number of Attributes:

4

Date Donated

1988-07-01

Associated Tasks:

Classification

Missing Values?

No

Number of Web Hits:

533125


Code:

[python]?view plaincopy
  • from?numpy?import?*??
  • #load?"iris.data"?to?workspace??
  • traindata?=?loadtxt("D:\ZJU_Projects\machine?learning\ML_Action\Dataset\Iris.data",delimiter?=?',',usecols?=?(0,1,2,3),dtype?=?float)??
  • trainlabel?=?loadtxt("D:\ZJU_Projects\machine?learning\ML_Action\Dataset\Iris.data",delimiter?=?',',usecols?=?(range(4,5)),dtype?=?str)??
  • feaname?=?["#0","#1","#2","#3"]?#?feature?names?of?the?4?attributes?(features)??

  • Result :

    ? ? ? ? ? ?

    左圖為實際數(shù)據(jù)集,四個離散型feature,一個label表示類別(有Iris-setosa, Iris-versicolor,Iris-virginica?三個類)




    2.?熵的計算

    entropy是香農(nóng)提出來的(信息論大牛),定義見wiki

    注意這里的entropy是H(C|X=xi)而非H(C|X), H(C|X)的計算見第下一個點(diǎn),還要乘以概率加和

    Code:

    [python]?view plaincopy
  • from?math?import?log??
  • def?calentropy(label):??
  • ????n?=?label.size?#?the?number?of?samples??
  • ????#print?n??
  • ????count?=?{}?#create?dictionary?"count"??
  • ????for?curlabel?in?label:??
  • ????????if?curlabel?not?in?count.keys():??
  • ????????????count[curlabel]?=?0??
  • ????????count[curlabel]?+=?1??
  • ????entropy?=?0??
  • ????#print?count??
  • ????for?key?in?count:??
  • ????????pxi?=?float(count[key])/n?#notice?transfering?to?float?first??
  • ????????entropy?-=?pxi*log(pxi,2)??
  • ????return?entropy??
  • ??
  • #testcode:??
  • #x?=?calentropy(trainlabel)??


  • Result:







    3.?根據(jù)最佳分割feature進(jìn)行數(shù)據(jù)分割

    假定我們已經(jīng)得到了最佳分割feature,在這里進(jìn)行分割(最佳feature為splitfea_idx)

    第二個函數(shù)idx2data是根據(jù)splitdata得到的分割數(shù)據(jù)的兩個index集合返回datal (samples less than pivot), datag(samples greater than pivot), labell, labelg。 這里我們根據(jù)所選特征的平均值作為pivot

    [python]?view plaincopy
  • #split?the?dataset?according?to?label?"splitfea_idx"??
  • def?splitdata(oridata,splitfea_idx):??
  • ????arg?=?args[splitfea_idx]?#get?the?average?over?all?dimensions??
  • ????idx_less?=?[]?#create?new?list?including?data?with?feature?less?than?pivot??
  • ????idx_greater?=?[]?#includes?entries?with?feature?greater?than?pivot??
  • ????n?=?len(oridata)??
  • ????for?idx?in?range(n):??
  • ????????d?=?oridata[idx]??
  • ????????if?d[splitfea_idx]?<?arg:??
  • ????????????#add?the?newentry?into?newdata_less?set??
  • ????????????idx_less.append(idx)??
  • ????????else:??
  • ????????????idx_greater.append(idx)??
  • ????return?idx_less,idx_greater??
  • ??
  • #testcode:2??
  • #idx_less,idx_greater?=?splitdata(traindata,2)??
  • ??
  • ??
  • #give?the?data?and?labels?according?to?index??
  • def?idx2data(oridata,label,splitidx,fea_idx):??
  • ????idxl?=?splitidx[0]?#split_less_indices??
  • ????idxg?=?splitidx[1]?#split_greater_indices??
  • ????datal?=?[]??
  • ????datag?=?[]??
  • ????labell?=?[]??
  • ????labelg?=?[]??
  • ????for?i?in?idxl:??
  • ????????datal.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))??
  • ????for?i?in?idxg:??
  • ????????datag.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))??
  • ????labell?=?label[idxl]??
  • ????labelg?=?label[idxg]??
  • ????return?datal,datag,labell,labelg??


  • 這里args是參數(shù),決定分裂節(jié)點(diǎn)的閾值(每個參數(shù)對應(yīng)一個feature,大于該值分到>branch,小于該值分到<branch),我們可以定義如下:

    [python]?view plaincopy
  • args?=?mean(traindata,axis?=?0)??



  • 測試:按特征2進(jìn)行分類,得到的less和greater set of indices分別為:


    也就是按args[2]進(jìn)行樣本集分割,<和>args[2]的branch分別有57和93個樣本。




    4. 根據(jù)最大信息增益選擇最佳分割feature

    信息增益為代碼中的info_gain, 注釋中是熵的計算

    [python]?view plaincopy
  • #select?the?best?branch?to?split??
  • def?choosebest_splitnode(oridata,label):??
  • ????n_fea?=?len(oridata[0])??
  • ????n?=?len(label)??
  • ????base_entropy?=?calentropy(label)??
  • ????best_gain?=?-1??
  • ????for?fea_i?in?range(n_fea):?#calculate?entropy?under?each?splitting?feature??
  • ????????cur_entropy?=?0??
  • ????????idxset_less,idxset_greater?=?splitdata(oridata,fea_i)??
  • ????????prob_less?=?float(len(idxset_less))/n??
  • ????????prob_greater?=?float(len(idxset_greater))/n??
  • ??????????
  • ????????#entropy(value|X)?=?\sum{p(xi)*entropy(value|X=xi)}??
  • ????????cur_entropy?+=?prob_less*calentropy(label[idxset_less])??
  • ????????cur_entropy?+=?prob_greater?*?calentropy(label[idxset_greater])??
  • ??????????
  • ????????info_gain?=?base_entropy?-?cur_entropy?#notice?gain?is?before?minus?after??
  • ????????if(info_gain>best_gain):??
  • ????????????best_gain?=?info_gain??
  • ????????????best_idx?=?fea_i??
  • ????return?best_idx????
  • ??
  • #testcode:??
  • #x?=?choosebest_splitnode(traindata,trainlabel)??



  • 這里的測試針對所有數(shù)據(jù),分裂一次選擇哪個特征呢?






    5.?遞歸構(gòu)建決策樹

    詳見code注釋,buildtree遞歸地構(gòu)建樹。

    遞歸終止條件:

    ①該branch內(nèi)沒有樣本(subset為空) or

    ②分割出的所有樣本屬于同一類?or?

    ③由于每次分割消耗一個feature,當(dāng)沒有feature的時候停止遞歸,返回當(dāng)前樣本集中大多數(shù)sample的label


    [python]?view plaincopy
  • #create?the?decision?tree?based?on?information?gain??
  • def?buildtree(oridata,?label):??
  • ????if?label.size==0:?#if?no?samples?belong?to?this?branch??
  • ????????return?"NULL"??
  • ????listlabel?=?label.tolist()??
  • ????#stop?when?all?samples?in?this?subset?belongs?to?one?class??
  • ????if?listlabel.count(label[0])==label.size:??
  • ????????return?label[0]??
  • ??????????
  • ????#return?the?majority?of?samples'?label?in?this?subset?if?no?extra?features?avaliable??
  • ????if?len(feanamecopy)==0:??
  • ????????cnt?=?{}??
  • ????????for?cur_l?in?label:??
  • ????????????if?cur_l?not?in?cnt.keys():??
  • ????????????????cnt[cur_l]?=?0??
  • ????????????cnt[cur_l]?+=?1??
  • ????????maxx?=?-1???
  • ????????for?keys?in?cnt:??
  • ????????????if?maxx?<?cnt[keys]:??
  • ????????????????maxx?=?cnt[keys]??
  • ????????????????maxkey?=?keys??
  • ????????return?maxkey??
  • ??????
  • ????bestsplit_fea?=?choosebest_splitnode(oridata,label)?#get?the?best?splitting?feature??
  • ????print?bestsplit_fea,len(oridata[0])??
  • ????cur_feaname?=?feanamecopy[bestsplit_fea]?#?add?the?feature?name?to?dictionary??
  • ????print?cur_feaname??
  • ????nodedict?=?{cur_feaname:{}}???
  • ????del(feanamecopy[bestsplit_fea])?#delete?current?feature?from?feaname??
  • ????split_idx?=?splitdata(oridata,bestsplit_fea)?#split_idx:?the?split?index?for?both?less?and?greater??
  • ????data_less,data_greater,label_less,label_greater?=?idx2data(oridata,label,split_idx,bestsplit_fea)??
  • ??????
  • ????#build?the?tree?recursively,?the?left?and?right?tree?are?the?"<"?and?">"?branch,?respectively??
  • ????nodedict[cur_feaname]["<"]?=?buildtree(data_less,label_less)??
  • ????nodedict[cur_feaname][">"]?=?buildtree(data_greater,label_greater)??
  • ????return?nodedict??
  • ??????
  • #testcode:??
  • #mytree?=?buildtree(traindata,trainlabel)??
  • #print?mytree??

  • Result:


    mytree就是我們的結(jié)果,#1表示當(dāng)前使用第一個feature做分割,'<'和'>'分別對應(yīng)less 和 greater的數(shù)據(jù)。





    6. 樣本分類

    根據(jù)構(gòu)建出的mytree進(jìn)行分類,遞歸走分支

    [python]?view plaincopy
  • #classify?a?new?sample??
  • def?classify(mytree,testdata):??
  • ????if?type(mytree).__name__?!=?'dict':??
  • ????????return?mytree??
  • ????fea_name?=?mytree.keys()[0]?#get?the?name?of?first?feature??
  • ????fea_idx?=?feaname.index(fea_name)?#the?index?of?feature?'fea_name'??
  • ????val?=?testdata[fea_idx]??
  • ????nextbranch?=?mytree[fea_name]??
  • ??????
  • ????#judge?the?current?value?>?or?<?the?pivot?(average)??
  • ????if?val>args[fea_idx]:??
  • ????????nextbranch?=?nextbranch[">"]??
  • ????else:??
  • ????????nextbranch?=?nextbranch["<"]??
  • ????return?classify(nextbranch,testdata)??
  • ??
  • #testcode??
  • tt?=?traindata[0]??
  • x?=?classify(mytree,tt)??
  • print?x??

  • Result:



    為了驗證代碼準(zhǔn)確性,我們換一下args參數(shù),把它們都設(shè)成0(很小)

    args = [0,0,0,0]

    建樹和分類的結(jié)果如下:


    可見沒有小于pivot(0)的項,于是dict中每個<的key對應(yīng)的value都為空。




    本文中全部代碼下載:決策樹python實現(xiàn)

    Reference:?Machine Learning in Action



    from:?http://blog.csdn.net/abcjennifer/article/details/20905311

    《新程序員》:云原生和全面數(shù)字化實踐50位技術(shù)專家共同創(chuàng)作,文字、視頻、音頻交互閱讀

    總結(jié)

    以上是生活随笔為你收集整理的决策树Decision Tree 及实现的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。