當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习模型性能评估_如何评估机器学习模型的性能

發布時間：2023/12/15 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习模型性能评估_如何评估机器学习模型的性能小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

機器學習模型性能評估

Table of contents:

目錄：

Why evaluation is necessary?
為什么需要評估？
Confusion Matrix
混淆矩陣
Accuracy
準確性
Precision & Recall
精度和召回率
ROC-AUC
中華民國
Log Loss
日志損失
Coefficient of Determination (R-Squared)
測定系數(R平方)
Summary
摘要

為什么需要評估？ (Why evaluation is necessary?)

Let me start with a very simple example.

讓我從一個非常簡單的例子開始。

Robin and Sam both started preparing for an entrance exam for engineering college. They both shared a room and put equal amount of hard work while solving numerical problems. They both studied almost the same hours for the entire year and appeared in the final exam. Surprisingly, Robin cleared but Sam did not. When asked, we got to know that their was one difference in their strategy of preparation, “test series”. Robin had joined a test series and he used to test his knowledge and understanding by giving those exams and then further evaluating where is he lagging. But Sam was confident and he just kept training himself.

羅賓和山姆都開始為工科大學準備入學考試。他們倆共享一個房間，并在解決數值問題時付出了相同的努力。他們倆全年學習了幾乎相同的時間，并參加了期末考試。出人意料的是，羅賓清除了，但薩姆沒有清除。當被問到時，我們知道他們是他們準備策略“測試系列”的一個區別。羅賓加入了一個測試系列，他過去通過參加那些考試來測試他的知識和理解力，然后進一步評估他落后的地方。但是山姆很自信，他只是不斷地訓練自己。

In the same fashion as discussed above, a machine learning model can be trained extensively with many parameters and new techniques but as long as you are skipping it’s evaluation, you cannot trust it.

以與上述相同的方式，可以使用許多參數和新技術對機器學習模型進行廣泛的訓練，但是只要您不對其進行評估，就無法信任它。

如何閱讀混淆矩陣？ (How to read Confusion Matrix?)

A confusion matrix is a correlation between the predictions of a model and the actual class labels of the data points.

混淆矩陣是模型的預測與數據點的實際類別標簽之間的相關性。

Let’s say you are building a model which detects whether a person has diabetes or not. After train-test split you got a test set of length 100 out of which 70 data points are labeled positive (1) and 30 data points are labelled negative (0). Now let me draw the matrix for your test prediction:

假設您正在建立一個模型來檢測一個人是否患有糖尿病。經過火車測試拆分后，您得到了長度為100的測試集，其中70個數據點標記為正(1)，而30個數據點標記為負(0)。現在，讓我為您的測試預測繪制矩陣：

Out of 70 actual positive data points, your model predicted 64 points as positive and 6 as negative. Out of 30 actual negative points, it predicted 3 as positive and 27 as negative.

在70個實際的陽性數據點中，您的模型預測64個點為正，6個點為負。在30個實際負點中，它預測3個正點和27個負點。

Note: In the notations True Positive, True Negative, False Positive & False Negative, notice that the second term (Positive or Negative) is denoting your prediction and 1st term denotes whether your predicted right or wrong.

注意：在符號True Positive，True Negative，False Positive和False Negative中 ，請注意第二項(Positive或Negative)表示您的預測，而第一項則表示您預測的是對還是錯。

Based on the above matrix we can define some very important ratios:

根據上面的矩陣，我們可以定義一些非常重要的比率：

TPR (True Positive Rate) =( True Positive / Actual Positive )
TPR(真正率)=(真正/實際正)
TNR (True Negative Rate) =( True Negative/ Actual Negative)
TNR(真負利率)=(真負/實際負)
FPR (False Positive Rate) =( False Positive / Actual Negative )
FPR(誤報率??)=(誤報/實際負)
FNR (False Negative Rate) =( False Negative / Actual Positive )
FNR(假陰性率)=(假陰性/實際陽性)

For our case of diabetes detection model we can calculate these ratios:

對于我們的糖尿病檢測模型，我們可以計算以下比率：

TPR = 91.4%

TPR = 91.4％

TNR = 90%

TNR = 90％

FPR = 10%

FPR = 10％

FNR = 8.6%

FNR = 8.6％

If you want your model to be smart then your model has to predict correctly. Which means your True Positives and True Negatives should be as high as possible and at the same time you need to minimize your mistakes for which your False Positives and False Negatives should be as low as possible. Also in terms of ratios, your TPR & TNR should be very high whereas FPR & FNR should be very low,

如果您希望模型很聰明，那么您的模型必須正確預測。這意味著您的“正肯定”和“負否定”應盡可能地高，同時您需要將錯誤正誤和“ 負否定”應盡可能低的錯誤最小化。 同樣在比率方面，您的TPR和TNR應該很高，而FPR和FNR應該非常低，

A smart model: TPR ↑ , TNR ↑, FPR ↓, FNR ↓

智能模型：TPR↑，TNR↑，FPR↓，FNR↓

A dumb model: Any other combination of TPR, TNR, FPR, FNR

愚蠢的模型：TPR，TNR，FPR，FNR的任何其他組合

One may argue that it is not possible to take care of all four ratios equally because at the end of the day no model is perfect. Then what should we do?

可能有人爭辯說，不可能平等地照顧所有四個比率，因為最終沒有一種模型是完美的。那我們該怎么辦？

Yes, it is true. So that is why we build a model keeping the domain in our mind. There are certain domain which demands us to keep a specific ratio as the main priority even at the cost of other ratios being poor. For e.g.In Cancer diagnosis, we cannot miss any positive patient at any cost. So we are suppose to keep TPR at the maximum and FNR close to 0. Even if we predict any healthy patient as diagnosed it is still okay as he can go for further check ups.

是的，它是真的。因此，這就是為什么我們要建立一個將領域牢記在心的模型。在某些領域中，即使以其他比率不佳為代價，也要求我們將特定比率作為主要優先事項。對于例如癌癥的診斷，我們不能不惜一切代價錯過任何陽性患者。因此，我們假設將TPR保持在最大值，并將FNR保持在0附近。即使我們預測診斷出的任何健康患者也可以，因為他可以進行進一步檢查。

準確性 (Accuracy)

Accuracy is what it’s literal meaning says, a measure of how accurate your model is.

準確度是其字面意思，表示模型的準確度。

Accuracy = Correct Predictions / Total Predictions

準確性=正確的預測/總預測

By using confusion matrix, Accuracy = (TP + TN)/(TP+TN+FP+FN)

通過使用混淆矩陣，精度=(TP + TN)/(TP + TN + FP + FN)

Accuracy is one of the simplest performance metric we can use. But let me warn you, accuracy can sometimes lead you to false illusions about your model and hence you should first know your data set and algorithm used then only decide whether to use accuracy or not.

準確性是我們可以使用的最簡單的性能指標之一。但是讓我警告您，準確性有時會導致您對模型產生錯誤的幻想，因此您應該首先了解所使用的數據集和算法，然后才決定是否使用準確性。

Before going to the failure cases of accuracy, let me introduce you with two types of data sets:

在討論準確性的失敗案例之前，讓我為您介紹兩種類型的數據集：

Balanced: A data set which contains almost equal entries for all labels/classes. E.g out of 1000 data points 600 are positive and 400 are negative.

平衡的：一個數據集，其中包含所有標簽/類別幾乎相等的條目。例如，在1000個數據點中，600個為正，400個為負。

Imbalanced: A data set which contains biased distribution of entries towards a particular label/class. E.g. out of 1000 entries 990 are positive class, 10 are negative class.

不平衡：一種數據集，其中包含偏向特定標簽/類別的條目的分布。例如，在1000個條目中，990個為肯定類別，10個為否定類別。

Very Important: Never use accuracy as a measure when dealing with imbalanced test set.

非常重要：處理不平衡的測試集時，切勿使用準確性作為度量。

Why?

為什么？

Suppose you have an imbalanced test set of 1000 entries with 990 (+ve) and 10 (-ve). And somehow you ended up creating a poor model which always predict “+ve” due to the imbalanced train set. Now when you predict your test set labels, it will always predict “+ve”. So out of 1000 test set points, you get 1000 “+ve” predictions. Then your accuracy would come,

假設您有一個不平衡的測試集，其中包含990(+ ve)和10(-ve)的1000個條目。最終，您以某種方式最終創建了一個糟糕的模型，該模型總是會因列車不平衡而預測“ + ve”。現在，當您預測測試集標簽時，它將始終預測“ + ve”。因此，從1000個測試設定點中，您可以獲得1000個“ + ve”預測。然后你的準確性就會來

990/1000 = 99%

990/1000 = 99％

Whoaa! Amazing! you are happy to see such an awesome accuracy score.

哇！驚人！您很高興看到如此出色的準確性得分。

But, you should know that your model is really poor which always predicts “+ve” label.

但是，您應該知道您的模型確實很差，總是預測為“ + ve”標簽。

Very Important: Also, we cannot compare two models which returns probability scores and have same accuracy.

非常重要：同樣，我們無法比較兩個返回概率得分并具有相同準確性的模型。

There are certain models which give probability of each data points for belonging to a particular class like that in Logistic Regression. Let us take this case:

有某些模型可以像Logistic回歸那樣給出每個數據點屬于特定類的概率。讓我們來考慮這種情況：

Table 1表格1

As you can see, If P(Y=1) > 0.5 it predicts class 1. When we calculate accuracy for both M1 and M2, it comes the same but it is quite evident that M1 is a much better model than M2 by taking a look the probability scores.

如您所見， 如果P(Y = 1)> 0.5則預測為類1。當我們計算M1和M2的精度時，得出的結果相同，但是很明顯， 通過取M1 ， M1比M2更好?？锤怕史謹?。

This issue is beautifully dealt by Log Loss which I will explain later in the blog.

Log Loss很好地解決了這個問題，我將在稍后的博客中進行解釋。

精度和召回率 (Precision & Recall)

Precision : It is the ratio of True Positives (TP) and the total positive predictions. Basically it tells us that how many times your positive prediction was actually positive.

精度：這是真實陽性率(TP)與陽性預測總數的比率。基本上，它告訴我們您的正面預測實際上是正面多少次。

Recall : It is nothing but TPR (True Positive Rate explained above).It tells us about out of all positive points how many were predicted positive.

回想一下： TPR (True Positive Rate上面已經解釋過了)什么都沒有，它告訴我們所有積極因素中有多少被預測為積極。

F- Measure: Harmonic mean of precision and recall.

F-測量：精確度和查全率的諧波平均值。

To understand this let’s see this example: When you ask a query in google, it returns 40 pages but only 30 were relevant. But your friend who is an employee at Google told you that there were 100 total relevant pages for that query. So it’s precision is 30/40 = 3/4 = 75% while it’s recall is 30/100 = 30%. So, in this case, precision is “how useful the search results are”, and recall is “how complete the results are”.

為了理解這一點，讓我們看一個例子：當您在google中查詢時，它返回40個頁面，但只有30個相關。但是您的Google雇員朋友告訴您，該查詢總共有100個相關頁面。所以它的精度是30/40 = 3/4 = 75％，而召回率是30/100 = 30％。因此，在這種情況下，精度是“搜索結果有多有用”，召回??率是“結果有多完整”。

ROC和AUC (ROC & AUC)

Receiver Operating Characteristic Curve (ROC):

接收器工作特性曲線(ROC)：

It is a plot between TPR (True Positive Rate) and FPR (False Positive Rate) calculated by taking multiple threshold values from the reverse sorted list of probability scores given by a model.

它是通過從模型給出的概率得分的反向排序列表中獲取多個閾值而計算出的TPR(真正率)和FPR(假正率)之間的關系圖 。

A typical ROC curve典型的ROC曲線

Now, how do we plot ROC?

現在，我們如何繪制ROC？

To answer this, let me take you back to table 1 above. Just consider M1 model. You see, for all x values we have a probability score. In that table we have assigned the data point which have a score more than 0.5 as class 1. Now sort all the values in descending order of probability scores and one by one take threshold values equal to all the probability scores. Then we will have threshold values = [0.96,0.94,0.92,0.14,0.11,0.08]. Corresponding to each threshold value predict the classes and calculate TPR and FPR. You will get 6 pairs of TPR & FPR. Just plot them, you will get ROC curve.

為了回答這個問題，讓我帶您回到上面的表1。僅考慮M1模型。您會看到，對于所有x值，我們都有一個概率得分。在該表中，我們將得分大于0.5的數據點分配為類別1。現在，以概率分數的降序對所有值進行排序，并以等于所有概率分數的閾值一一取值。然后，我們將獲得閾值= [0.96,0.94,0.92,0.14,0.11,0.08]。對應每個閾值預測類別并計算TPR和FPR。您將獲得6對TPR和FPR。只需繪制它們，您將獲得ROC曲線。

Note: Since maximum TPR and FPR value is 1, the area under curve (AUC) of ROC lies betweem 0 and 1.

注意：由于最大TPR和FPR值為1，因此ROC的曲線下面積(AUC)位于0和1之間。

Area under the blue dashed line is 0.5. AUC = 0 means very poor model, AUC = 1 means perfect model. As long as your model’s AUC score is more than 0.5. your model is making sense because even a random model can score 0.5 AUC.

藍色虛線下方的面積是0.5。 AUC = 0表示模型很差，AUC = 1表示模型完美。只要您模型的AUC分數大于0.5。您的模型很有意義，因為即使是隨機模型也可以得分0.5 AUC。

Very Important: You can get very high AUC even in a case of dumb model generated from imbalanced data set. So always be careful while dealing with imbalanced data set.

非常重要：即使在從不平衡數據集生成的啞模型的情況下，您也可以獲得很高的AUC。因此，在處理不平衡的數據集時請務必小心。

Note: AUC had nothing to do with the numerical values probability scores as long as order is maintained. AUC for all the models will be same as long as all the models give same order of data points after sorting based on probability scores.

注意：只要維持順序，AUC與數值概率分數無關。只要所有模型在根據概率得分進行排序后給出相同順序的數據點，所有模型的AUC都將相同。

對數損失 (Log-Loss)

This performance metric checks the deviation of probability scores of the data points from the cut off score and assigns penalty proportional to the deviation.

該性能度量檢查數據點的概率分數與截止分數的偏差，并分配與偏差成比例的懲罰。

For each data point in a binary classification, we calculate it’s log loss using the formula below,

對于二進制分類中的每個數據點，我們使用以下公式計算對數損失：

Log Loss formula for a Binary Classification二元分類的對數損失公式

here p = probability of the data point to belong to class 1, and y is the class label (0 or 1).

在這里，p =數據點屬于類別1的概率，而y是類別標簽(0或1)。

Suppose if p_1 for some x_1 is 0.95 and p_2 for some x_2 is 0.55 and cut off probability for qualifying for class 1 is 0.5. Then both qualifies for class 1 but log loss of p_2 will be much more than log loss of p_1.

假設某些x_1的p_1為0.95，某些x_2的p_2為0.55，并且符合1類條件的截止概率為0.5。然后兩者都符合類別1的條件，但是p_2的對數丟失將比p_1的對數丟失大得多。

As you can see from the curve, the range of log loss is [0, infinity).

從曲線中可以看到，對數損失的范圍是[0，無窮大]。

For each data point in a multi class classification, we calculate it’s log loss using the formula below,

對于多類別分類中的每個數據點，我們使用以下公式計算對數損失：

Log Loss formula for multi class classification多類別分類的對數損失公式

y(o,c)=1 if x(o,c) belongs to class 1. Rest of the concept is same.

如果x(o，c)屬于類別1，則y(o，c)= 1。其余概念相同。

測定系數 (Coefficient of Determination)

It is denoted by R2.While predicting target values of test set we encounter a few errors (e_i) which is difference between predicted value and actual value.

用R 2表示。在預測測試集的目標值時，我們會遇到一些誤差(e_i)，這是預測值與實際值之間的差異。

Let’s say we have a test set with n entries. As we know all the data points will have a target value say [y1,y2,y3…….yn]. Let us take the predicted values of the test data be [f1,f2,f3,……fn].

假設我們有一個包含n個條目的測試集。眾所周知，所有數據點都有一個目標值[y1，y2，y3…….yn]。讓我們將測試數據的預測值設為[f1，f2，f3，……fn]。

Calculate the Residual Sum of Squares ,which is the sum of all the errors (e_i) squared , by using this formula where fi is the predicted target value by a model for i’th data point.

計算平方的殘差平方和，這是所有(e_i)的平方，通過使用這個公式其中Fi是第i個用于數據點的模型所預測的目標值的誤差的總和。

Total Sum of Squares總平方和

Take the mean of all the actual target values:

取所有實際目標值的平均值：

Then calculate Total Sum of Squares which is proportional to the variance of the test set target values:

然后計算與測試集目標值的方差成比例的總平方和 ：

If you observe both the formulas of sum of squares you can see that the only difference is the 2nd term i.e. y_bar and fi. Total sum of squares somewhat gives us an intuition that it is same as residual sum of squares only but with predicted values as [?, ?, ?,…….? ,n times]. Yes, your intuition is right. Let’s say there is a very simple mean model which gives prediction the average of target values every time irrespective of input data.

如果同時觀察兩個平方和公式，則可以看到唯一的區別是第二項，即y_bar和fi。平方總和在某種程度上給我們一種直覺，即它僅與殘差平方和相同，但預測值為[?，?，?，…….?，n次]。是的，您的直覺是正確的。假設有一個非常簡單的均值模型，無論輸入數據如何，該模型均會每次預測目標值的平均值。

Now we formulate R2 as:

現在我們將R2表示為：

As you can see now, R2 is a metric to compare your model with a very simple mean model which return average of target values every time irrespective of input data. The comparison has 4 cases:

正如您現在所看到的，R2是一種度量，用于將模型與非常簡單的均值模型進行比較，均值模型每次均返回目標值的平均值，而與輸入數據無關。比較有4種情況：

case 1: SS_R = 0

情況1：SS_R = 0

(R2 = 1) Perfect model with no errors at all.

(R2= 1)完美的模型，完全沒有錯誤。

case 2: SS_R > SS_T

情況2：SS_R> SS_T

(R2 < 0) Model is even worse than the simple mean model.

(R2<0)模型甚至比簡單的均值模型差。

case 3: SS_R = SS_T

情況3：SS_R = SS_T

(R2 = 0) Model is same as the simple mean model.

(R2= 0)模型與簡單均值模型相同。

case 4:SS_R < SS_T

情況4：SS_R <SS_T

(0<R2 <1) Model is okay.

(0 <R2<1)模型還可以。

摘要 (Summary)

So, in nutshell you should know your data set and problem very well and then you can always create a confusion matrix and check for it’s accuracy, precision, recall and plot the ROC curve and find out AUC as per your needs. But if your data set is imbalanced never use accuracy as a measure. If you want to evaluate your model even more deeply so that your probability scores are also given weight then go for Log Loss.

因此，簡而言之，您應該非常了解您的數據集和問題，然后您始終可以創建一個混淆矩陣，并檢查其準確性，精度，調用并繪制ROC曲線并根據需要找到AUC。但是，如果您的數據集不平衡，切勿使用準確性作為衡量標準。如果您想對模型進行更深入的評估，以使您的概率得分也得到權重，請考慮對數損失。

Remember, always evaluate your training!

請記住，請務必評估您的訓練！

翻譯自: https://medium.com/swlh/how-to-evaluate-the-performance-of-your-machine-learning-model-40769784d654