當前位置：首頁 > 编程语言 > python >内容正文

python

History of pruning algorithm development and python implementation(finished)

發布時間：2023/12/20 python 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 History of pruning algorithm development and python implementation(finished) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

All the python-implementation for 7 post-pruning Algorithms
are here.

Table of Decision Trees:

name of treeinventername of articleyear

ID3	Ross Quinlan	《Discovering rules by induction from large collections of examples》	1979
ID3	Ross Quinlan	Another origin:《Learning efficient classification procedures and their application to chess end games》	1983a
CART	Leo Breiman	《Classification and Regression Trees》	1984
C4.5	Ross Quinlan	《C4.5: Programming for machine learning》	1993
C5.0	Ross Quinlan	Commercial Edition of C4.5 ,no relevant papers	-

Table of Pruning Algorithms:

name of post-pruning algorithmname of article or bookyearinventerthe tree prunedRemark

Pessimistic Pruning	《Simplifying Decision Trees》part2.3	1986b(也有1987b的說法，這里以論文上寫的時間為準)	Quinlan	C4.5	Ross Quinlan　invented “Pessimistic Pruning”， John Mingers rename it as “Pessimistic Error Pruning” in his article《An Empirical Comparison of Pruning Methods for Decision Tree induction》
Reduced Error Pruning	《Simplifying Decision Trees》part2.2	1986b	Quinlan	C4.5	需要額外的驗證集才能剪枝
Cost-Complexity Pruning	《Classification and Regression Trees》3.3節	1984	L Breiman	CART	For Classification Tree only
Error-Complexity Pruning	《Classification and Regression Trees》8.5.1節	1984	L Breiman	CART	For Regression Tree only
Critical Value Pruning	《Expert System-Rule Induction with Statistical Data》,還有一說是:《An Empirical Comparison of Pruning Methods for Decision Tree Induction》但是該文作者自己說是引用自1987年的論文	1987a	John Mingers	論文中沒有明說哪一種	關于CVP算法的出處眾說紛紜，這里的出處是以《An Empirical Comparison of Pruning Methods for Ensemble Classifiers》P212提到的為準
Minimum-Error Pruning	《Learning decision rules in noisy domains》	1986	Niblett and Bratko	－	Can Not be Downloaded from Internet
Minimum-Error Pruning	《on estimating probabilities in tree pruning》	1991	Bojan Cestnik,Ivan Bratko	－	modified MEP algorithm
Error-Based Pruning	《C4.5: Programs for Machine Learning 》4.2節	1993	Quinlan	C4.5	EBP is an evolution of PEP

Pruning Target of classification trees(ID3,C4.5,CART-classification)
1.simplifying your classification trees without losing precision too much
2.improve Generalization ability of validation Sets.
3.alleviate overfit

Pruning Target of classification trees(CART-Regression):
1.simplifying your classification trees without increasing MSE too much
2.improve Generalization ability of validation Sets.
3.alleviate overfit

----------------For C4.5-----------------
U25%(1,16) and U25%(1,168)on《C4.5:programs for machine learning》
Weka的-3.6.10的C4.5與Quinlan教授的C4.5算法的區別
《C4.5: Programs for Machine Learning》chaper4實驗結果重現
C4.5-Release8中Ross Quinlan對缺失值的處理
C4.5最新版本Release8與ＭＤＬ的關系的詳細解讀
some understanding of《Inferring Decision Trees Using the Minimum Description Length Principle*》
some understanding of《Improved Use of Continuous Attributes in C4.5》
C4.5-Release8的代碼架構圖

------------REP-start--------
ID3的REP（Reduced Error Pruning）剪枝代碼詳細解釋+周志華《機器學習》決策樹圖4.5、圖4.6、圖4.7繪制
周志華《機器學習》圖4.4和圖4.9繪制(轉載+增加熵顯示功能)
ID3決策樹中連續值的處理+周志華《機器學習》圖4.8和圖4.10繪制
sklearn沒有實現ID3算法

------------REP-end--------

----------------PEP-start-----------------
Earliest PEP Algorithm Principles
Pessimistic Error Pruning example of C4.5
Pessimistic error pruning illustration with C4.5-python implemention
----------------PEP-end----------------

-------------EBP-start--------------------
Error Based Pruning剪枝算法、代碼實現與舉例
這里有人會質疑為何不直接采用weka中的J48的python接口？
注意，weka是以quinlan的C語言版本代碼為準的(說白了就是J48抄的C4.5-Release8)，在某些數據集中，例如使用hypo數據集，weka生成的決策樹很龐大,這是非常糟糕的。
因為決策樹的初衷是幫助分類，生成知識，
十分龐大的決策樹顯然是不利于使用的。
J48:Java edition of C4.5-Release8
-------------EBP-end--------------------

-------------MEP-start--------------------
Two Examples of Minimum Error Pruning(reprint)
MEP(minimum error pruning) principle with python implemention

-----------------MEP-end--------------------

-----------------CVP-start--------------------
proof of Chi-square statistics used in CVP
CVP(Critical Value Pruning)illustration with clear principle in details
CVP(Critical value pruning)examples with python implemention
Error in a paper about CVP
contingency(列聯表)python計算與實驗結果分析
python卡方分布計算
-----------------CVP-end--------------------

------------------------For CART-------------------------------------------------------
notes from《classification and regression trees》
＼－－－－－－－－CCP start－－－－－－－－－－－－－－－－－－－－－
《統計學習方法》P59決策樹繪制-sklearn版本
CCP(Cost complexity pruning) on sklearn with python implemention
Theory Defect in selecting best pruned tree from CCP with Cross-validation
1SE rule details in CCP pruning of CART
＼－－－－－－－－CCP-end－－－－－－－－－－－－－－－－－－－－－

＼－－－－－－－－ECP start－－－－－－－－－－－－－－－－－－－－－
舉例講清楚模型樹和回歸樹的區別
Error Complexity Pruning for sklearn’s Regression Tree with Python Implementation
＼－－－－－－－－ECP end－－－－－－－－－－－－－－－－－－－－－

Note:
Critical Value Pruning can be both used in pre-pruning and post-pruning.
When in pre-pruing,IM(Information Measure)is replaced with Chi-Square Statistics.
When post-pruning with $χ2\chi^2$ test,then you need an independent datasets.
Of course,if you grow a tree with CVP,then you need not post-prune it with CVP with the same data which is used to grow the tree.

$TablesaboutwhetheryouneedextravalidationsetswhenpruningTables\ about\ whether\ you\ need\ extra\ validation\ sets\ when\ pruning$
In “Reference and Quotation” of the following table,
the word “test"means sets with
“class label”,so it actually means"validation datasets”.

Pruning AlgorithmNeed extra validation datasets?Reference and Quotation

REP(Reduced Error Pruning)	yes	According to 《An Empirical Comparison of Pruing Methods for Decision Tree Induction》Section2.2.4: “The pruned node will often make fewer errors on the test data than the sub-tree makes.”
CCP(Cost Complexity Pruning)	pruning stage:No selecting best pruned tree stage: ①(small datasets)cross-validation:yes ②(large datasets)1-SE rule:no	According to《Simplifying Decision Trees》part 2.1: “Finally,the procedure requires a test set distinct from the original training set”
ECP(Error Complexity Pruning)	pruning stage:No selecting best pruned tree stage: ①(small datasets)cross-validation:yes ②(large datasets)1-SE rule:no	According to《An Empirical Comparison of Pruing Methods for Decision Tree Induction》part 2.2.1-page230: “Instead,each of the pruned trees is used to classify an independent test data set”
CVP(Critical Value Pruning)	pre-pruning:no post-pruning:yes	《Expert System-Rule Induction with Statistical Data》(pre-prune): “Intial runs were performed using chi-square purely as a means of choosing attributes-not of judging their significance-onthe two years separately and on the data as a whole.”
MEP(Minimum Error Pruning)	It all depends how you set m	According to《ON ESTIMATING PROBABILITIES IN TREE PRUNING》 Section1: "m can be adjusted to some essential properties of the learning domain,such as the level of noise in the learning data." Section 2: “m can be set so as to maximise the classification accuracy on an independent data set”
PEP(Pessimistic Error Pruning)	no	According to 《Simplifying Decision Trees》Section2.3: “Unlike these methods,it does not require a test set separate from the cases in the training set from which the tree was constructed.”
EBP(Error Based Pruning)	no	According to《C4.5:Programs for machine learning》Page 40th: “The approach taken in C4.5 belongs to the second family of techniques that use only the training set from which the tree was build.”

Atttention:
When your datasets is small,you need Cross Validation in CCP、ECP to produce k candidate model,and each candidate model is then pruned many times to get its pruned tree sequence (K such sequences totally),and finally to choose the best pruned tree from among the K sequences.

When your datasets is large,you need 1-SE Rule in CCP、ECP to select the best pruned tree.
In above Github Link ,we use the ② method.

$PruningStyleTableofAlgorithmsPruning\ Style\ Table\ of\ Algorithms$

Pruning AlgorithmPruning StyleReference and Quotation

REP(Reduced Error Pruning)	bottom-up	NOT mentioned in earliest article 《Simplifying Decision Trees》about it
CCP(Cost Complexity Pruning)	bottom-up	《Classification and Regression Trees》 -LEO BREIMAN Chapter 3.3 MINIMAL COST-COMPLEXITY PRUNING: "Thus, the pruning process produces a finite sequence of subtrees T1 , T2 , T3 , …with progressively fewer terminal nodes "
ECP(Error Complexity Pruning)	bottom-up	《Classification and Regression Trees》 -LEO BREIMAN Chpater8.5.1: “Now minimal error-complexity pruning is done exactly as minimal cost-complexity pruning in classification.”
CVP(Critical Value Pruning)	top-down in pre-pruning bottom-up in post-pruing	pre-pruning 《Expert Systems-Rule Induction with Statistical Data》"The ID3 algorithm,with the enhancements mentioned previously, $^1$ was modified to calculate $χ2\chi^2$ instead of IM" post-pruning: https://www.cs.rit.edu/~rlaz/prec2010/slides/DecisionTrees.pdf
MEP(Minimum Error Pruning)	bottom-up	《The effects of pruning methods on the predictive accuracy of induced decision rules》: “Niblett and Bratko[26] proposed a bottom-up approach for seaching a single tree that minimizes the expected error rate.”
PEP(Pessimistic Error Pruning)	top-down	《Top-Down Induction of Decision Trees Classifiers – A Survey》part E: “The pessimistic pruning procedure performs top-down traversing over the internal nodes.”
EBP(Error Based Pruning)	bottom-up	《C4.5: Programs for Machine Learning》page 39th: “Start from the bottom of the tree and examine each nonleaf sub-tree.”

Do all above pruning tend to be overpruning or underpruning?
According to the following 2 articles:
《Simplifying Decision Trees by Pruning and Grafting:New Results》
《Top-Down Induction of Decision Trees Classifiers –A Survey》
The result table is as follows:

Pruning Algorithmtendency of pruning

REP	over-pruning or not significant
PEP	under-pruning
EBP	under-pruning
MEP	under-pruning
CVP	under-pruning
CCP	under-pruning(from my own experiment)
ECP	under-pruning (from my own experiment)

markdown tables generation table
https://tool.lu/tables

總結

以上是生活随笔為你收集整理的History of pruning algorithm development and python implementation(finished)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： C4.5-Release8中Ross Q
下一篇：《C4.5: Programs for