當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

透过性别看世界_透过树林看森林

發(fā)布時(shí)間：2023/12/15 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了透过性别看世界_透过树林看森林小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

透過(guò)性別看世界

決策樹(shù)如何運(yùn)作 (How a Decision Tree Works)

Pictorially, a decision tree is like a flow-chart where the parent nodes represent an attribute’s test and the leaf nodes represent the final category assigned to the datapoints which made it to that leaf.

從圖片上看，決策樹(shù)就像流程圖，其中父節(jié)點(diǎn)代表屬性的測(cè)試，葉節(jié)點(diǎn)代表分配給數(shù)據(jù)點(diǎn)的最終類別，從而使數(shù)據(jù)點(diǎn)到達(dá)該葉。

Figure 1— Students sample distribution圖1－學(xué)生樣本分布

In the illustration above, a total of 13 students were randomly sampled from a students performance dataset. The scatter plot shows a distribution of the sample based on two attributes:

在上圖中，從學(xué)生成績(jī)數(shù)據(jù)集中隨機(jī)抽取了13名學(xué)生。散點(diǎn)圖基于兩個(gè)屬性顯示了樣本的分布：

raisedhands: number of times the student raised his/her hands in class to ask or answer questions.

舉手：學(xué)生在課堂上舉手提問(wèn)或回答問(wèn)題的次數(shù)。

visitedResources: How many times the student visited a course content.

VisitedResources：該學(xué)生瀏覽過(guò)一次課程內(nèi)容的次數(shù)。

Our intent is to manually construct a decision tree that can best separate the sample data points into the distinct classes — L, M, H where:

我們的目的是手動(dòng)構(gòu)建決策樹(shù)，該決策樹(shù)可以將樣本數(shù)據(jù)點(diǎn)最好地分為不同的類-L，M，H，其中：

L= Lower performance category

L =較低性能類別

M= Medium(average) performance category

M =中(平均)性能類別

H= High performance category

H =高性能類別

Option A選項(xiàng)A

An option is to split the data along the attribute — visitedResources, at point mark 70.

一種選擇是沿標(biāo)記點(diǎn)70沿屬性VisitedResources拆分?jǐn)?shù)據(jù)。

This “perfectly” separates the H class from the rest.

這“完美地”將H類與其他人分開(kāi)。

Option B選項(xiàng)B

One other option is to split along the same attribute — visitedResources, at point mark 41.

另一種選擇是沿同一個(gè)屬性(在點(diǎn)標(biāo)記41)進(jìn)行拆分：visitedResources。

No “perfect” separation is achieved for any class.

任何類別都無(wú)法實(shí)現(xiàn)“完美”的分離。

Option C選項(xiàng)C

Another option is to split along the attribute — raisedhands, at point mark 38.

另一種選擇是沿著屬性-舉手，在點(diǎn)標(biāo)記38處分割。

This “perfectly” separates the L class from the rest.

這“完美地”將L類與其他類分開(kāi)。

Options A and C did a better job at separating at least one of the classes. Suppose we pick option A, the resultant decision three will be:

選項(xiàng)A和C在分離至少一個(gè)類方面做得更好。假設(shè)我們選擇了選項(xiàng)A，那么最終的決定三將是：

The left branch has only H class students, hence, cannot be separated any further. On the right branch, the resultant node has four students each in the M and L classes.

左分支只有H級(jí)學(xué)生，因此不能再分開(kāi)。在右側(cè)分支上，結(jié)果節(jié)點(diǎn)在M和L班級(jí)中分別有四個(gè)學(xué)生。

Remember that this is the current state of our separation exercise.

請(qǐng)記住，這是我們分離工作的當(dāng)前狀態(tài)。

How best can the remaining students(data points) be separated into their appropriate classes? Yes, you guessed right — draw more lines!

其余的學(xué)生(數(shù)據(jù)點(diǎn))如何最好地分成各自合適的班級(jí)？是的，您猜對(duì)了-畫(huà)出更多的線！

One option is to split along the attribute — raisedhands, at point mark 38.

一種選擇是沿屬性(舉手，在點(diǎn)標(biāo)記38)進(jìn)行拆分。

Again, any number of split lines can be drawn, however, this option seems to yield a good result, so, we shall go with it.

同樣，可以繪制任意數(shù)量的分割線，但是，此選項(xiàng)似乎產(chǎn)生了很好的效果，因此，我們將繼續(xù)使用它。

The resultant decision tree after the split is shown below:

拆分后的結(jié)果決策樹(shù)如下所示：

Clearly, the data points are perfectly separated into the appropriate classes, hence no further logical separation is needed.

顯然，數(shù)據(jù)點(diǎn)被完美地分為適當(dāng)?shù)念?#xff0c;因此不需要進(jìn)一步的邏輯分離。

到目前為止吸取的教訓(xùn)： (Lessons Learnt So Far:)

In ML parlance, this process of building out a decision tree that best classifies a given dataset is interpreted or referred to as Learning.

用ML的話來(lái)說(shuō)，構(gòu)建決策樹(shù)以最好地對(duì)給定數(shù)據(jù)集進(jìn)行分類的過(guò)程被解釋為或稱為L(zhǎng)earning 。

This learning process is iterative.

此學(xué)習(xí)過(guò)程是迭代的。

Several decision trees of varying levels of prediction accuracy can be derived from the same dataset, subject to the split attribute choices made and tree depth allowed.

可以從同一個(gè)數(shù)據(jù)集中獲得具有不同級(jí)別的預(yù)測(cè)準(zhǔn)確性的幾棵決策樹(shù)，但要進(jìn)行分割屬性選擇并允許樹(shù)深度。

In manually constructing the decision tree, we learnt that the separation lines can be drawn at any point along any of the attributes available in a dataset. The question is, at any given decision node, which of the possible attributes and separation points will do a better job of separating the dataset into the desired or near-desired classes or categories? An instrument to determining the answer to this question is the Gini Impurity.

在手動(dòng)構(gòu)建決策樹(shù)時(shí)，我們了解到可以沿著數(shù)據(jù)集中任何可用屬性的任意點(diǎn)繪制分隔線。問(wèn)題是，在任何給定的決策節(jié)點(diǎn)上，哪種可能的屬性和分離點(diǎn)將更好地將數(shù)據(jù)集分離為所需的或接近所需的類或類別？確定這個(gè)問(wèn)題答案的工具是基尼雜質(zhì) 。

基尼雜質(zhì) (Gini Impurity)

Suppose we have a new student and we randomly classify this new student into any of the three classes based on the probability distribution of the classes. The gini impurity is a measure of the likelihood of incorrectly classifying that new random student(variable). It is a probabilistic measure, hence it’s bounded between 0 and 1.

假設(shè)我們有一個(gè)新學(xué)生，我們根據(jù)班級(jí)的概率分布將該新學(xué)生隨機(jī)分為三個(gè)班級(jí)中的任何一個(gè)。基尼雜質(zhì)是對(duì)新隨機(jī)學(xué)生(變量)進(jìn)行錯(cuò)誤分類的可能性的度量。這是一種概率測(cè)度，因此范圍在0到1之間。

We have a total of 13 students in our sample dataset and the probability distribution of H, M and L class are 5/13, 4/13 and 4/13 respectively.

我們的樣本數(shù)據(jù)集中共有13名學(xué)生，H，M和L班級(jí)的概率分布分別為5 / 13、4 / 13和4/13。

The formular below is applied in calculating gini impurity:

以下公式用于計(jì)算基尼雜質(zhì)：

The above formular when applied in our example case becomes:

當(dāng)在我們的示例情況下應(yīng)用時(shí)，上述公式將變?yōu)?#xff1a;

Therefore gini impurity at the root node of the decision tree before any split, will be computed as::

因此，在任何拆分之前，決策樹(shù)根節(jié)點(diǎn)處的基尼雜質(zhì)將被計(jì)算為：

Recall the earlier discussed split options A and C at the root node, let us compare the gini impurities of the two options and see why A was picked as a better split choice.

回想一下之前討論過(guò)的根節(jié)點(diǎn)拆分選項(xiàng)A和C，讓我們比較兩個(gè)選項(xiàng)的基尼雜質(zhì)，并了解為什么選擇A作為更好的拆分選項(xiàng)。

Option A選項(xiàng)A Option C選項(xiàng)C

Therefore, the amount of impurity removed with split option A — gini gain is: 0.66–0.3=0.36. While that for split option C is: 0.66–0.37=0.29.

因此，使用分割選項(xiàng)A-gini增益去除的雜質(zhì)量為：0.66-0.3 = 0.36 。而拆分選項(xiàng)C的系數(shù)為：0.66-0.37 = 0.29 。

Obviously, gini gain 0.36>0.29, hence, option A is a better split choice, informing the earlier decision to pick A over C.

顯然，基尼系數(shù)增加0.36> 0.29，因此，選項(xiàng)A是更好的拆分選擇，這表明了較早的選擇A勝過(guò)C的決定。

The gini impurity at a node where all the students are of only one class, say H, is always equal to zero — meaning no impurity. This implies a perfect classification, hence, no further split is needed.

在所有學(xué)生僅屬于一個(gè)班級(jí)的節(jié)點(diǎn)(例如H)上的基尼雜質(zhì)始終等于零，這意味著沒(méi)有雜質(zhì)。這意味著一個(gè)完美的分類，因此不需要進(jìn)一步的拆分。

隨機(jī)森林 (Random Forest)

We have seen that many decision trees can be generated from the same dataset, and that the performance of the trees at correctly predicting unseen examples can vary. Also, using a single tree model (decision tree) can easily lead to over-fitting.

我們已經(jīng)看到，可以從同一個(gè)數(shù)據(jù)集生成許多決策樹(shù)，并且在正確預(yù)測(cè)看不見(jiàn)的示例時(shí)樹(shù)的性能可能會(huì)有所不同。同樣，使用單個(gè)樹(shù)模型(決策樹(shù))很容易導(dǎo)致過(guò)度擬合。

The question becomes: how do we make sure to construct the best possible performant tree? An answer to this is to smartly construct as many trees as possible and use averaging to improve the predictive accuracy and control over-fitting. This method is called the Random Forest. It is random because each tree is constructed using not all the training dataset but a random sample of the dataset and attributes.

問(wèn)題就變成了：我們?nèi)绾未_保構(gòu)建性能最佳的樹(shù)？對(duì)此的一種解決方案是智能地構(gòu)造盡可能多的樹(shù)，并使用求平均值來(lái)提高預(yù)測(cè)準(zhǔn)確性和控制過(guò)度擬合。此方法稱為隨機(jī)森林。這是隨機(jī)的，因?yàn)椴皇鞘褂盟杏?xùn)練數(shù)據(jù)集而是使用數(shù)據(jù)集和屬性的隨機(jī)樣本來(lái)構(gòu)造每棵樹(shù)。

We shall use the random forest algorithm implementation in Scikit-learn python package to demonstrate how a random forest model can be trained, tested as well as visualize one of the trees that constitute the forest.

我們將使用Scikit-learn python軟件包中的隨機(jī)森林算法實(shí)現(xiàn)來(lái)演示如何訓(xùn)練，測(cè)試以及可視化構(gòu)成森林的其中一棵樹(shù)。

For this exercise, we shall train a random forest model to predict(classify) the academic performance category (Class) which students belong to, based on their participation in class/learning processes.

在本練習(xí)中，我們將根據(jù)學(xué)生對(duì)課堂/學(xué)習(xí)過(guò)程的參與程度，訓(xùn)練一個(gè)隨機(jī)森林模型來(lái)預(yù)測(cè)(分類)學(xué)生所屬的學(xué)習(xí)成績(jī)類別(Class)。

In the dataset for this exercise, students’ participation is defined as a measure of four variables, which are:

在此練習(xí)的數(shù)據(jù)集中，學(xué)生的參與被定義為四個(gè)變量的度量，它們是：

Raised hand: How many times the student raised his/her hands in class to ask or answer questions (numeric:0–100)

舉手：學(xué)生在課堂上舉手問(wèn)或回答問(wèn)題的次數(shù)(數(shù)字：0-100)

Visited resources: How many times the student visited a course content(numeric:0–100)

造訪資源：學(xué)生造訪課程內(nèi)容的次數(shù)(數(shù)字：0-100)

Viewing announcements: How many times the student checked the news announcements(numeric:0–100)

查看公告：學(xué)生查看新聞公告的次數(shù)(數(shù)字：0-100)

Discussion groups: How many times the student participated in discussion groups (numeric:0–100)

討論小組：該學(xué)生參加了多少次討論小組(數(shù)字：0-100)

In the sample extract below, the first four(4) numeric columns correspond to the students’ participation measures defined earlier, and the last column — Class which is categorical, represents the students performance. A student can be in either of three(3) classes — Low, Medium or High performance.

在下面的示例摘錄中，前四(4)個(gè)數(shù)字列對(duì)應(yīng)于之前定義的學(xué)生參與度，而最后一列是“類別”，它表示學(xué)生的表現(xiàn)。學(xué)生可以是三(3)類任一L -流中，M edium或H IGH性能。

Figure-1: Dataset extract: Students participation measures and performance class圖1：數(shù)據(jù)集摘錄：學(xué)生的參與度和表現(xiàn)班

Basic data preparation steps:

基本數(shù)據(jù)準(zhǔn)備步驟：

Load dataset

加載數(shù)據(jù)集

Clean or preprocess data. All features in this dataset are already in the right format and there exist no missing values. In my experience, this is rarely the case in ML projects, as some degree of cleaning or preprocessing is usually required.

清理或預(yù)處理數(shù)據(jù)。該數(shù)據(jù)集中的所有要素均已采用正確的格式，并且不存在缺失值。以我的經(jīng)驗(yàn)，在ML項(xiàng)目中很少出現(xiàn)這種情況，因?yàn)橥ǔＰ枰欢ǔ潭鹊那鍧嵒蝾A(yù)處理。

Encode label. This is necessary as the label(Class) in this dataset is categorical.

編碼標(biāo)簽。這是必需的，因?yàn)榇藬?shù)據(jù)集中的label(Class)是分類的。

Split dataset into train and test sets.

將數(shù)據(jù)集分為訓(xùn)練集和測(cè)試集。

An implementation of all the above steps is shown in the snippet below:

下面的代碼段顯示了上述所有步驟的實(shí)現(xiàn)：

Next, we shall create a RandomForest instance and fit (build the tree) the model to the train set.

接下來(lái)，我們將創(chuàng)建一個(gè)RandomForest實(shí)例，并將模型擬合(構(gòu)建樹(shù))以適合火車集合。

Where:

哪里：

n_estimators = number of trees to make the forest

n_estimators = 造林的樹(shù)木數(shù)量

criterion = what method to use in picking the best attribute split option for the decision trees. Here, we see the gini impurity being used.

條件 =為決策樹(shù)選擇最佳屬性拆分選項(xiàng)時(shí)使用的方法。在這里，我們看到使用了基尼雜質(zhì)。

max_depth = This is a cap on the depth of the trees. If at this depth, no clear classification is arrived at, the model will consider all the nodes at the level to be leaf nodes. Also, for each leaf node, the data points are classified to be of the majority class in that node.

max_depth =這是樹(shù)木深度的上限。如果在此深度上沒(méi)有清晰的分類，則模型會(huì)將該級(jí)別的所有節(jié)點(diǎn)視為葉節(jié)點(diǎn)。同樣，對(duì)于每個(gè)葉節(jié)點(diǎn)，數(shù)據(jù)點(diǎn)被分類為該節(jié)點(diǎn)中的多數(shù)類。

Note that the optimal n_estimators and max_depth combination can only be determined by experimenting with several combinations. One way to achieve this is by using the grid search method.

請(qǐng)注意，只能通過(guò)試驗(yàn)幾種組合來(lái)確定最佳的n_estimators和max_depth組合。實(shí)現(xiàn)此目的的一種方法是使用網(wǎng)格搜索方法。

模型評(píng)估 (Model Evaluation)

While there exist several metrics for evaluating models, we shall use one if not the most basic one — accuracy.

盡管存在幾種評(píng)估模型的指標(biāo)，但我們將使用一種(即使不是最基本的)指標(biāo)-準(zhǔn)確性。

Accuracy on train set: 72.59%, test set: 68.55% — could be better but not a bad benchmark.

列車定型的準(zhǔn)確度：72.59％ ， 測(cè)試定型：68.55％-可能更好，但基準(zhǔn)不錯(cuò)。

可視化森林中最好的樹(shù) (Visualizing The Best Tree in the Forest)

The most optimal tree in a random forest model can be visualized easily, enabling both the engineer, scientist and business specialists to have some understanding of the decision-flow of the model.

可以輕松地可視化隨機(jī)森林模型中最優(yōu)化的樹(shù)，使工程師，科學(xué)家和業(yè)務(wù)專家都可以對(duì)模型的決策流程有所了解。

The snippet below extracts and visualizes the most optimal tree from the above-trained model:

下面的代碼片段從上面訓(xùn)練的模型中提取并可視化了最佳樹(shù)：

Decision tree extracted from the random forest.從隨機(jī)森林中提取決策樹(shù)。

結(jié)論： (Conclusion:)

In this article, we succeeded in looking at how a decision tree works, understanding how the attributes split-choices are made using the gini impurity, how several decision trees are ensembled to make a random forest, and finally, demonstrated the usage of the random forest algorithm by training a random forest model to classify students into academic performance categories based on their participation in class/learning processes.

在本文中，我們成功地研究了決策樹(shù)的工作原理，了解了如何使用基尼雜質(zhì)生成拆分選擇屬性，如何將幾棵決策樹(shù)組合成一個(gè)隨機(jī)森林，最后演示了隨機(jī)樹(shù)的用法。森林算法，方法是訓(xùn)練隨機(jī)森林模型，以根據(jù)學(xué)生對(duì)課堂/學(xué)習(xí)過(guò)程的參與程度將其分類為學(xué)習(xí)成績(jī)類別。

Thanks for reading.

謝謝閱讀。

翻譯自: https://towardsdatascience.com/seeing-the-forest-through-the-trees-45deafe1a6f0

透過(guò)性別看世界

總結(jié)

以上是生活随笔為你收集整理的透过性别看世界_透过树林看森林的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： cnn图像进行预测_CNN方法：使用聚合
下一篇： gan神经网络_神经联觉：当艺术遇见GA