當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据安全分类分级实施指南_不平衡数据集分类指南

發(fā)布時間：2023/12/14 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了数据安全分类分级实施指南_不平衡数据集分类指南小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

數(shù)據(jù)安全分類分級實施指南

重點 (Top highlight)

Balance within the imbalance to balance what’s imbalanced — Amadou Jarou Bah

在不平衡中保持平衡以平衡不平衡— Amadou Jarou Bah

Disclaimer: This is a comprehensive tutorial on handling imbalanced datasets. Whilst these approaches remain valid for multiclass classification, the main focus of this article will be on binary classification for simplicity.

免責聲明：這是有關處理不平衡數(shù)據(jù)集的綜合教程。盡管這些方法對于多類分類仍然有效，但為簡單起見，本文的主要重點將放在二進制分類上。

介紹 (Introduction)

As any seasoned data scientist or statistician will be aware of, datasets are rarely distributed evenly across attributes of interest. Let’s imagine we are tasked with discovering fraudulent credit card transactions — naturally, the vast majority of these transactions will be legitimate, and only a very small proportion will be fraudulent. Similarly, if we are testing individuals for cancer, or for the presence of a virus (COVID-19 included), the positive rate will (hopefully) be only a small fraction of those tested. More examples include:

正如任何經(jīng)驗豐富的數(shù)據(jù)科學家或統(tǒng)計學家都會意識到的那樣，數(shù)據(jù)集很少會在感興趣的屬性之間均勻分布。想象一下，我們負有發(fā)現(xiàn)欺詐性信用卡交易的任務-自然，這些交易中的絕大多數(shù)都是合法的，只有很小一部分是欺詐性的。同樣，如果我們正在測試個人是否患有癌癥或是否存在病毒(包括COVID-19)，那么(希望)陽性率僅是所測試者的一小部分。更多示例包括：

An e-commerce company predicting which users will buy items on their platform
一家電子商務公司預測哪些用戶將在其平臺上購買商品
A manufacturing company analyzing produced materials for defects
一家制造公司分析所生產材料的缺陷
Spam email filtering trying to differentiation ‘ham’ from ‘spam’
垃圾郵件過濾試圖區(qū)分“火腿”和“垃圾郵件”
Intrusion detection systems examining network traffic for malware signatures or atypical port activity
入侵檢測系統(tǒng)檢查網(wǎng)絡流量中是否存在惡意軟件簽名或非典型端口活動
Companies predicting churn rates amongst their customers
預測客戶流失率的公司
Number of clients who closed a specific account in a bank or financial organization
在銀行或金融組織中關閉特定帳戶的客戶數(shù)量
Prediction of telecommunications equipment failures
預測電信設備故障
Detection of oil spills from satellite images
從衛(wèi)星圖像檢測漏油
Insurance risk modeling
保險風險建模
Hardware fault detection
硬件故障檢測

One has usually much fewer datapoints from the adverse class. This is unfortunate as we care a lot about avoiding misclassifying elements of this class.

通常，來自不利類的數(shù)據(jù)點少得多。 這很不幸，因為我們非常在意避免對此類元素進行錯誤分類。

In actual fact, it is pretty rare to have perfectly balanced data in classification tasks. Oftentimes the items we are interested in analyzing are inherently ‘rare’ events for the very reason that they are rare and hence difficult to predict. This presents a curious problem for aspiring data scientists since many data science programs do not properly address how to handle imbalanced datasets given their prevalence in industry.

實際上，在分類任務中擁有完全平衡的數(shù)據(jù)非常罕見。通常，我們感興趣的項目本質上是“稀有”事件，原因是它們很少見，因此難以預測。對于有抱負的數(shù)據(jù)科學家而言，這是一個令人好奇的問題，因為鑒于其在行業(yè)中的普遍性，許多數(shù)據(jù)科學程序無法正確解決如何處理不平衡的數(shù)據(jù)集。

數(shù)據(jù)集什么時候變得“不平衡”？ (When does a dataset become ‘imbalanced’?)

The notion of an imbalanced dataset is a somewhat vague one. Generally, a dataset for binary classification with a 49–51 split between the two variables would not be considered imbalanced. However, if we have a dataset with a 90–10 split, it seems obvious to us that this is an imbalanced dataset. Clearly, the boundary for imbalanced data lies somewhere between these two extremes.

不平衡數(shù)據(jù)集的概念有些模糊。通常，在兩個變量之間劃分為49-51的二進制分類數(shù)據(jù)集不會被認為是不平衡的。但是，如果我們有一個90-10分割的數(shù)據(jù)集，對我們來說顯然這是一個不平衡的數(shù)據(jù)集。顯然，不平衡數(shù)據(jù)的邊界介于這兩個極端之間。

In some sense, the term ‘imbalanced’ is a subjective one and it is left to the discretion of the data scientist. In general, a dataset is considered to be imbalanced when standard classification algorithms — which are inherently biased to the majority class (further details in a previous article) — return suboptimal solutions due to a bias in the majority class. A data scientist may look at a 45–55 split dataset and judge that this is close enough that measures do not need to be taken to correct for the imbalance. However, the more imbalanced the dataset becomes, the greater the need is to correct for this imbalance.

從某種意義上說，“不平衡”一詞是主觀的，由數(shù)據(jù)科學家自行決定。通常，當標準分類算法(固有地偏向多數(shù)類(在上一篇文章中有更多詳細信息))由于多數(shù)類的偏向而返回次優(yōu)解時，則認為數(shù)據(jù)集不平衡。數(shù)據(jù)科學家可以查看45–55的分割數(shù)據(jù)集，并判斷該數(shù)據(jù)集足夠接近，因此無需采取措施來糾正不平衡。但是，數(shù)據(jù)集變得越不平衡，就越需要糾正這種不平衡。

In a concept-learning problem, the data set is said to present a class imbalance if it contains many more examples of one class than the other.

在概念學習問題中，如果數(shù)據(jù)集包含一個類別的實例多于另一個類別的實例，則稱該數(shù)據(jù)集存在類別不平衡。

As a result, these classifiers tend to ignore small classes while concentrating on classifying the large ones accurately.

結果，這些分類器傾向于忽略小類別，而專注于準確地對大類別進行分類。

Imagine you are working for Netflix and are tasked with determining which customer churn rates (a customer ‘churning’ means they will stop using your services or using your products).

想象您正在為Netflix工作，并負責確定哪些客戶流失率(客戶“流失”意味著他們將停止使用您的服務或產品)。

In an ideal world (at least for the data scientist), our training and testing datasets would be close to fully balanced, having around 50% of the dataset containing individuals that will churn and 50% who will not. In this case, a 90% accuracy will more or less indicate a 90% accuracy on both the positively and negatively classed groups. Our errors will be evenly split across both groups. In addition, we have roughly the same number of points in both classes, which from the law of large numbers tells us reduces the overall variance in the class. This is great for us, accuracy is an informative metric in this situation and we can continue with our analysis unimpeded.

在理想的世界中(至少對于數(shù)據(jù)科學家而言)，我們的訓練和測試數(shù)據(jù)集將接近完全平衡，大約50％的數(shù)據(jù)集包含會攪動的人和50％不會攪動的人。在這種情況下，90％的準確度將或多或少地表明在正面和負面分類組中都達到90％的準確度。我們的錯誤將平均分配給兩個組。此外，兩個類中的點數(shù)大致相同，這從大數(shù)定律可以看出，這減少了類中的總體方差。這對我們來說非常好，在這種情況下，準確性是一個有用的指標，我們可以繼續(xù)進行不受阻礙的分析。

A dataset with an even 50–50 split across the binary response variable. There is no majority class in this example.二進制響應變量之間均分50–50的數(shù)據(jù)集。此示例中沒有多數(shù)類。

As you may have suspected, most people that already pay for Netflix don't have a 50% chance of stopping their subscription every month. In fact, the percentage of people that will churn is rather small, closer to a 90–10 split. How does the presence of this dataset imbalance complicate matters?

您可能會懷疑，大多數(shù)已經(jīng)為Netflix付款的人沒有50％的機會每月停止訂閱。實際上，會流失的人數(shù)比例很小，接近90-10。這個數(shù)據(jù)集的不平衡如何使問題復雜化？

Assuming a 90–10 split, we now have a very different data story to tell. Giving this data to an algorithm without any further consideration will likely result in an accuracy close to 90%. This seems pretty good, right? It’s about the same as what we got previously. If you try putting this model into production your boss will probably not be so happy.

假設拆分為90-10，我們現(xiàn)在要講一個非常不同的數(shù)據(jù)故事。將此數(shù)據(jù)提供給算法而無需進一步考慮，可能會導致接近90％的精度。這看起來還不錯吧？它與我們之前獲得的內容大致相同。如果您嘗試將這種模型投入生產，您的老板可能不會很高興。

An imbalanced dataset with a 90–10 split. False positives will be much larger than false negatives. Variance in the minority set will be larger due to fewer data points. The majority class will dominate algorithmic predictions without any correction for imbalance.分割為90-10的不平衡數(shù)據(jù)集。假陽性比假陰性要大得多。由于較少的數(shù)據(jù)點，少數(shù)派集中的方差會更大。多數(shù)類將主導算法預測，而無需對不平衡進行任何校正。

Given the prevalence of the majority class (the 90% class), our algorithm will likely regress to a prediction of the majority class. The algorithm can pretty closely maximize its accuracy (our scoring metric of choice) by arbitrarily predicting that the majority class occurs every time. This is a trivial result and provides close to zero predictive power.

給定多數(shù)類別(90％類別)的患病率，我們的算法可能會回歸到多數(shù)類別的預測。通過任意預測每次都會出現(xiàn)多數(shù)類，該算法可以非常精確地最大程度地提高其準確性(我們的選擇評分標準)。這是微不足道的結果，并提供接近零的預測能力。

(Left) A balanced dataset with the same number of items in the positive and negative class; the number of false positives and false negatives in this scenario are roughly equivalent and result in little classification bias. (Right) An imbalanced dataset with around 5% of samples being in the negative class and 95% of samples being in the positive class (this could be the number of people that pay for Netflix that decide to quit during the next payment cycle).(左)一個平衡的數(shù)據(jù)集，其中正數(shù)和負數(shù)類的項目數(shù)相同；在這種情況下，假陽性和假陰性的數(shù)量大致相等，并且?guī)缀鯖]有分類偏差。 (右)一個不平衡的數(shù)據(jù)集，其中約5％的樣本屬于負面類別，而95％的樣本屬于正面類別(這可能是為Netflix付款并決定在下一個付款周期退出的人數(shù))。

Predictive accuracy, a popular choice for evaluating the performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly.

當數(shù)據(jù)不平衡和/或不同錯誤的成本明顯不同時，預測準確性是評估分類器性能的一種普遍選擇，可能不合適。

Visually, this dataset might look something like this:

從視覺上看，該數(shù)據(jù)集可能看起來像這樣：

Machine learning algorithms by default assume that data is balanced. In classification, this corresponds to a comparative number of instances of each class. Classifiers learn better from a balanced distribution. It is up to the data scientist to correct for imbalances, which can be done in multiple ways.

默認情況下，機器學習算法假定數(shù)據(jù)是平衡的。在分類中，這對應于每個類的比較實例數(shù)。分類器從均衡的分布中學習得更好。數(shù)據(jù)科學家可以糾正不平衡，這可以通過多種方式來完成。

不同類型的失衡 (Different Types of Imbalance)

We have clearly shown that imbalanced datasets have some additional challenges to standard datasets. To further complicate matters, there are different types of imbalance that can occur in a dataset.

我們已經(jīng)清楚地表明，不平衡的數(shù)據(jù)集對標準數(shù)據(jù)集還有一些其他挑戰(zhàn)。更復雜的是，數(shù)據(jù)集中可能會出現(xiàn)不同類型的失衡。

(1) Between-Class

(1)課間

A between-class imbalance occurs when there is an imbalance in the number of data points contained within each class. An example of this is shown below:

當每個類中包含的數(shù)據(jù)點數(shù)量不平衡時，將發(fā)生類間不平衡。下面是一個示例：

An illustration of between-class imbalance. We have a large number of data points for the red class but relatively few for the white class.類間失衡的例證。紅色類別的數(shù)據(jù)點很多，而白色類別的數(shù)據(jù)點相對較少。

An example of this would be a mammography dataset, which uses images known as mammograms to predict breast cancer. Consider the number of mammograms related to positive and negative cancer diagnoses:

這樣的一個例子是乳腺X射線攝影數(shù)據(jù)集，它使用稱為乳腺X線照片的圖像來預測乳腺癌。考慮與陽性和陰性癌癥診斷相關的乳房X線照片數(shù)量：

The vast majority of samples (>90%) are negative, whilst relatively few (<10%) are positive.絕大多數(shù)樣本(> 90％)為陰性，而相對少數(shù)(<10％)為陽性。

Note that given enough data samples in both classes the accuracy will improve as the sampling distribution is more representative of the data distribution, but by virtue of the law of large numbers, the majority class will have inherently better representation than the minority class.

請注意，如果兩個類別中都有足夠的數(shù)據(jù)樣本，則精度會隨著采樣分布更能代表數(shù)據(jù)分布而提高，但是由于數(shù)量規(guī)律，多數(shù)類別在本質上要比少數(shù)類別更好。

(2) Within-Class

(2)班內

A within-class imbalance occurs when the dataset has balanced between-class data but one of the classes is not representative in some regions. An example of this is shown below:

當數(shù)據(jù)集具有平衡的類間數(shù)據(jù)，但其中一個類在某些區(qū)域中不具有代表性時，會發(fā)生類內不平衡。下面是一個示例：

An illustration of within-class imbalance. We have a large number of data points for both classes but the number of data points in the white class in the top left corner is very sparse, which can result in similar complications as between-class imbalance for predictions in those regions.類內失衡的例證。這兩個類別都有大量數(shù)據(jù)點，但是左上角的白色類別中的數(shù)據(jù)點數(shù)量非常稀疏，這可能導致與這些區(qū)域中的類間不平衡預測相似的復雜情況。

(3) Intrinsic and Extrinsic

(3)內部和外部

An intrinsic imbalance is due to the nature of the dataset, while extrinsic imbalance is related to time, storage, and other factors that limit the dataset or the data analysis. Intrinsic characteristics are relatively simple and are what we commonly see, but extrinsic imbalance can exist separately and can also work to increase the imbalance of a dataset.

內在的不平衡歸因于數(shù)據(jù)集的性質， 而外在的不平衡則與時間，存儲以及其他限制數(shù)據(jù)集或數(shù)據(jù)分析的因素有關。內部特征相對簡單，這是我們通常看到的特征，但是外部不平衡可以單獨存在，也可以用來增加數(shù)據(jù)集的不平衡。

For example, companies often use intrusion detection systems that analyze packets of data sent in and out of networks in order to detect malware of malicious activity. Depending on whether you analyze all data or just data sent through specific ports or specific devices, this will significantly influence the imbalance of the dataset (most network traffic is likely legitimate). Similarly, if log files or data packets related to suspected malicious behavior are commonly stored but normal log are not (or only a select few types are stored), then this can also influence the imbalance of the dataset. Similarly, if logs were only stored during a normal working day (say, 9–5 PM) instead of 24 hours, this will also affect the imbalance.

例如，公司經(jīng)常使用入侵檢測系統(tǒng)來分析進出網(wǎng)絡的數(shù)據(jù)包，以檢測惡意活動的惡意軟件。根據(jù)您是分析所有數(shù)據(jù)還是僅分析通過特定端口或特定設備發(fā)送的數(shù)據(jù)，這將嚴重影響數(shù)據(jù)集的不平衡(大多數(shù)網(wǎng)絡流量可能是合法的)。同樣，如果通常存儲與可疑惡意行為有關的日志文件或數(shù)據(jù)包，但不存儲常規(guī)日志(或僅存儲少數(shù)幾種類型的日志)，則這也可能會影響數(shù)據(jù)集的不平衡。同樣，如果日志僅在正常工作日(例如9-5 PM)而非24小時內存儲，這也會影響不平衡。

不平衡的進一步復雜化 (Further Complication of Imbalance)

There are a couple more difficulties increased by imbalanced datasets. Firstly, we have class overlapping. This is not always a problem, but can often arise in imbalanced learning problems and cause headaches. Class overlapping is illustrated in the below dataset.

不平衡的數(shù)據(jù)集會增加更多的困難。首先，我們有班級重疊 。這并不總是一個問題，但是經(jīng)常會在學習不平衡的問題中出現(xiàn)并引起頭痛。下面的數(shù)據(jù)集說明了類重疊。

Example of class overlapping. Some of the positive data points (stars) are intermixed with the negative data points (circles), which would lead an algorithm to construct an imperfect decision boundary.類重疊的示例。一些正數(shù)據(jù)點(星號)與負數(shù)據(jù)點(圓)混合在一起，這將導致算法構造不完善的決策邊界。

Class overlapping occurs in normal classification problems, so what is the additional issue here? Well, the class more represented in overlap regions tends to be better classified by methods based on global learning (on the full dataset). This is because the algorithm is able to get a more informed picture of the data distribution of the majority class.

在正常的分類問題中會發(fā)生類重疊，那么這里還有什么其他問題？好吧，在重疊區(qū)域中表示更多的類傾向于通過基于全局學習的方法(在完整數(shù)據(jù)集上)更好地分類。這是因為該算法能夠獲得多數(shù)類數(shù)據(jù)分布的更多信息。

In contrast, the class less represented in such regions tends to be better classified by local methods. If we take k-NN as an example, as the value of k increases, it becomes increasingly global and increasingly local. It can be shown that performance for low values of k has better performance on the minority dataset, and lower performance at high values of k. This shift in accuracy is not exhibited for the majority class because it is well-represented at all points.

相反，在此類區(qū)域中較少代表的類別傾向于通過本地方法更好地分類。如果以k-NN為例，隨著k值的增加，它變得越來越全球化，也越來越局部化。可以證明，k值較低時的性能在少數(shù)數(shù)據(jù)集上具有較好的性能，而k值較高時的性能較低。準確性的這種變化在大多數(shù)類別中都沒有表現(xiàn)出來，因為它在所有方面都得到了很好的體現(xiàn)。

This suggests that local methods may be better suited for studying the minority class. One method to correct for this is the CBO Method. The CBO Method uses cluster-based resampling to identify ‘rare’ cases and resample them individually, so as to avoid the creation of small disjuncts in the learned hypothesis. This is a method of oversampling — a topic that we will discuss in detail in the following section.

這表明本地方法可能更適合于研究少數(shù)群體。一種糾正此問題的方法是CBO方法 。 CBO方法使用基于聚類的重采樣來識別“稀有”案例并分別對其進行重采樣，以避免在學習的假設中產生小的歧義。這是一種過采樣的方法-我們將在下一節(jié)中詳細討論這個主題。

CBO Method. Once the training examples of each class have been clustered, oversampling starts. In the majority class, all the clusters, except for the largest one, are randomly oversampled so as to get the same number of training examples as the largest cluster.CBO方法。一旦將每個班級的訓練示例進行了聚類，就會開始進行過度采樣。在多數(shù)類中，除最大的聚類外，所有聚類均被隨機過采樣，以便獲得與最大聚類相同數(shù)量的訓練樣例。

糾正數(shù)據(jù)集不平衡 (Correcting Dataset Imbalance)

There are several techniques to control for dataset imbalance. There are two main types of techniques to handle imbalanced datasets: sampling methods, and cost-sensitive methods.

有幾種控制數(shù)據(jù)集不平衡的技術。處理不平衡數(shù)據(jù)集的技術主要有兩種： 抽樣方法和成本敏感方法 。

The simplest and most commonly used of these are sampling methods called oversampling and undersampling, which we will go into more detail on.

其中最簡單，最常用的是稱為過采樣和欠采樣的采樣方法，我們將對其進行詳細介紹。

Oversampling/Undersampling

過采樣/欠采樣

Simply stated, oversampling involves generating new data points for the minority class, and undersampling involves removing data points from the majority class. This acts to somewhat reduce the extent of the imbalance in the dataset.

簡而言之，過采樣涉及為少數(shù)類生成新的數(shù)據(jù)點，而欠采樣涉及從多數(shù)類中刪除數(shù)據(jù)點。這在某種程度上減少了數(shù)據(jù)集中的不平衡程度。

What does undersampling look like? We continually remove like-samples in close proximity until both classes have the same number of data points.

欠采樣是什么樣的？我們會不斷刪除附近的相似樣本，直到兩個類具有相同數(shù)量的數(shù)據(jù)點。

Undersampling. Imagine you are analysing a dataset for fraudulent transactions. Most of the transactions are not fraudulent, creating a fundamentally imbalanced dataset. In the scenario of undersampling, we will take fewer samples from the majority class to help reduce the extent of this imbalance.欠采樣。假設您正在分析數(shù)據(jù)集中的欺詐性交易。大多數(shù)交易不是欺詐性的，從而造成了根本上不平衡的數(shù)據(jù)集。在抽樣不足的情況下，我們將從多數(shù)類別中抽取較少的樣本，以幫助減少這種不平衡的程度。

Is undersampling a good idea? Undersampling is recommended by many statistical researchers but is only good if enough data points are available on the undersampled class. Also, since the majority class will end up with the same number of points as the minority class, the statistical properties of the distributions will become ‘looser’ in a sense. However, we have not artificially distorted the data distribution with this method by adding in artificial data points.

采樣不足是個好主意嗎？許多統(tǒng)計研究人員建議進行欠采樣，但是只有在欠采樣類別上有足夠的數(shù)據(jù)點可用時，采樣才是好的。同樣，由于多數(shù)類最終將獲得與少數(shù)類相同的分數(shù)，因此從某種意義上說，分布的統(tǒng)計屬性將變?yōu)椤拜^弱”。但是，我們沒有通過添加人工數(shù)據(jù)點來使用這種方法人為地扭曲數(shù)據(jù)分布。

Illustration of undersampling. Like-samples in close proximity are removed in an attempt to increase the sparsity of the data distribution.欠采樣的插圖。為了提高數(shù)據(jù)分布的稀疏性，刪除了附近的相似樣本。

What does oversampling look like? In shot, the opposite of undersampling. We are artificially adding data points to our dataset to make the number of instances in each class balanced.

過采樣看起來像什么？在拍攝中，欠采樣的情況與之相反。我們正在人為地向數(shù)據(jù)集中添加數(shù)據(jù)點，以使每個類中的實例數(shù)量保持平衡。

Oversampling. In the scenario of oversampling, we will oversample from the minority class to help reduce the extent of this imbalance.過采樣。在過度采樣的情況下，我們將對少數(shù)群體進行過度采樣，以幫助減少這種不平衡的程度。

How do we generate these samples? The most common way is to generate points that are close in dataspace proximity to existing samples or are ‘between’ two samples, as illustrated below.

我們如何生成這些樣本？最常見的方法是生成在數(shù)據(jù)空間中與現(xiàn)有樣本接近或在兩個樣本“之間”的點，如下所示。

Illustration of oversampling.過度采樣的插圖。

As you may have suspected, there are some downsides to adding false data points. Firstly, you risk overfitting, especially if one does this for points that are noise — you end up exacerbating this noise by adding reinforced measurements. In addition, adding these values randomly can also contribute additional noise to our model.

您可能已經(jīng)懷疑過，添加錯誤的數(shù)據(jù)點有一些缺點。首先，您可能會面臨過度擬合的風險，特別是如果對噪聲點進行過度擬合時，最終會通過添加增強的測量來加劇這種噪聲。此外，隨機添加這些值也會給我們的模型帶來額外的噪聲。

SMOTE (Synthetic minority oversampling technique)

SMOTE(合成少數(shù)群體過采樣技術)

Luckily for us, we don’t have to write an algorithm for randomly generating data points for the purpose of oversampling. Instead, we can use the SMOTE algorithm.

對我們來說幸運的是，我們不必編寫用于過采樣的隨機生成數(shù)據(jù)點的算法。相反，我們可以使用SMOTE算法。

How does SMOTE work? SMOTE generates new samples in between existing data points based on their local density and their borders with the other class. Not only does it perform oversampling, but can subsequently use cleaning techniques (undersampling, more on this shortly) to remove redundancy in the end. Below is an illustration for how SMOTE works when studying class data.

SMOTE如何工作？ SMOTE根據(jù)現(xiàn)有數(shù)據(jù)點的局部密度及其與其他類別的邊界在新數(shù)據(jù)點之間生成新樣本。它不僅執(zhí)行過采樣，而且可以隨后使用清除技術(欠采樣，稍后對此進行更多介紹)最終消除冗余。下面是學習班級數(shù)據(jù)時SMOTE如何工作的圖示。

An illustration of how SMOTE functions. The instance on the left is isolated and is thus considered noise by the algorithm. No additional data points are generated in its proximity, or, if they are, they will be in very close proximity to the singular point. The two clusters in the center and right have several data points, indicating that it is less likely that these points correspond to random noise. Thus, a larger cluster (empirical data distribution) can be drawn by the algorithm from which additional samples can be generated.SMOTE的功能說明。左側的實例被隔離，因此被算法視為噪聲。不會在其附近生成任何其他數(shù)據(jù)點，或者如果它們是，它們將非常靠近奇異點。中央和右側的兩個群集具有幾個數(shù)據(jù)點，表明這些點對應于隨機噪聲的可能性較小。因此，可以通過該算法得出更大的聚類(經(jīng)驗數(shù)據(jù)分布)，從中可以生成其他樣本。

The algorithm for SMOTE is as follows. For each minority sample:

SMOTE的算法如下。對于每個少數(shù)族裔樣本：

– Find its k-nearest minority neighbours

–尋找其k最近的少數(shù)族裔鄰居

– Randomly select j of these neighbours

–隨機選擇這些鄰居中的j個

– Randomly generate synthetic samples along the lines joining the minority sample and its j selected neighbours (j depends on the amount of oversampling desired)

–沿連接少數(shù)樣本及其j個選定鄰居的直線隨機生成合成樣本(j取決于所需的過采樣量)

Informed vs. Random Oversampling

知情vs.隨機過采樣

Using random oversampling (with replacement) of the minority class has the effect of making the decision region for the minority class very specific. In a decision tree, it would cause a new split and often lead to overfitting. SMOTE’s informed oversampling generalizes the decision region for the minority class. As a result, larger and less specific regions are learned, thus, paying attention to minority class samples without causing overfitting.

使用少數(shù)類的隨機過采樣 (替換)具有使少數(shù)類的決策區(qū)域非常具體的效果。在決策樹中，這將導致新的分裂并經(jīng)常導致過度擬合。 SMOTE的明智超采樣概括了少數(shù)群體的決策區(qū)域。結果，學習了更大和更少的特定區(qū)域，因此，在不引起過度擬合的情況下注意少數(shù)類樣本。

Drawbacks of SMOTE

SMOTE的缺點

Overgeneralization. SMOTE’s procedure can be dangerous since it blindly generalizes the minority area without regard to the majority class. This strategy is particularly problematic in the case of highly skewed class distributions since, in such cases, the minority class is very sparse with respect to the majority class, thus resulting in a greater chance of class mixture.

過度概括。 SMOTE的程序可能很危險，因為它盲目地將少數(shù)民族地區(qū)泛化而無視多數(shù)階級。這種策略在階級分布高度偏斜的情況下尤其成問題，因為在這種情況下，少數(shù)階級相對于多數(shù)階級而言非常稀疏，因此導致階級混合的機會更大。

Inflexibility. The number of synthetic samples generated by SMOTE is fixed in advance, thus not allowing for any flexibility in the re-balancing rate.

僵硬。 SMOTE生成的合成樣本的數(shù)量是預先固定的，因此再平衡速率不具有任何靈活性。

Another potential issue is that SMOTE might introduce the artificial minority class examples too deeply in the majority class space. This drawback can be resolved by hybridization: combining SMOTE with undersampling algorithms. One of the most famous of these is Tomek Links. Tomek Links are pairs of instances of opposite classes who are their own nearest neighbors. In other words, they are pairs of opposing instances that are very close together.

另一個潛在的問題是，SMOTE可能會在多數(shù)階層的空間中過于深入地介紹人工少數(shù)群體的例子。這個缺點可以通過雜交解決：將SMOTE與欠采樣算法結合在一起。其中最著名的就是Tomek Links 。 Tomek鏈接是一對相反類別的實例，它們是自己最近的鄰居。換句話說，它們是一對非常靠近的相對實例。

Tomek’s algorithm looks for such pairs and removes the majority instance of the pair. The idea is to clarify the border between the minority and majority classes, making the minority region(s) more distinct. Scikit-learn has no built-in modules for doing this, though there are some independent packages (e.g., TomekLink, imbalanced-learn).

Tomek的算法會查找此類對，并刪除該對的多數(shù)實例。這樣做的目的是弄清少數(shù)民族和多數(shù)階級之間的界限，使少數(shù)民族地區(qū)更加鮮明。盡管有一些獨立的軟件包(例如TomekLink ， imbalanced -learn )，但Scikit-learn沒有內置模塊可以執(zhí)行此操作。

Thus, Tomek’s algorithm is an undersampling technique that acts as a data cleaning method for SMOTE to regulate against redundancy. As you may have suspected, there are many additional undersampling techniques that can be combined with SMOTE to perform the same function. A comprehensive list of these functions can be found in the functions section of the imbalanced-learn documentation.

因此，Tomek的算法是一種欠采樣技術，可作為SMOTE調節(jié)冗余的數(shù)據(jù)清洗方法。您可能已經(jīng)懷疑，還有許多其他的欠采樣技術可以與SMOTE結合使用以執(zhí)行相同的功能。這些功能的全面列表可在不平衡學習文檔的功能部分中找到。

An additional example is Edited Nearest Neighbors (ENN). ENN removes any example whose class label differs from the class of at least two of their neighbor. ENN removes more examples than the Tomek links does and also can remove examples from both classes.

另一個示例是“最近的鄰居”(ENN)。 ENN刪除任何其類別標簽不同于其至少兩個鄰居的類別的示例。與Tomek鏈接相比，ENN刪除的示例更多，并且還可以從兩個類中刪除示例。

Other more nuanced versions of SMOTE include Borderline SMOTE, SVMSMOTE, and KMeansSMOTE, and more nuanced versions of the undersampling techniques applied in concert with SMOTE are Condensed Nearest Neighbor (CNN), Repeated Edited Nearest Neighbor, and Instance Hardness Threshold.

SMOTE的其他細微差別版本包括Borderline SMOTE，SVMSMOTE和KMeansSMOTE，與SMOTE結合使用的欠采樣技術的細微差別版本是壓縮最近鄰(CNN)，重復編輯最近鄰和實例硬度閾值。

成本敏感型學習 (Cost-Sensitive Learning)

We have discussed sampling techniques and are now ready to discuss cost-sensitive learning. In many ways, the two approaches are analogous — the main difference being that in cost-sensitive learning we perform under- and over-sampling by altering the relative weighting of individual samples.

我們已經(jīng)討論了采樣技術，現(xiàn)在準備討論對成本敏感的學習。在許多方面，這兩種方法是相似的-主要區(qū)別在于在成本敏感型學習中，我們通過更改單個樣本的相對權重來進行欠采樣和過采樣。

Upweighting. Upweighting is analogous to over-sampling and works by increasing the weight of one of the classes keeping the weight of the other class at one.

增重。 上權類似于過采樣，其工作方式是增加一個類別的權重，將另一類別的權重保持為一個。

Down-weighting. Down-weighting is analogous to under-sampling and works by decreasing the weight of one of the classes keeping the weight of the other class at one.

減重。 減權類似于欠采樣，它通過減小一個類別的權重而將另一類別的權重保持為一個來工作。

An example of how this can be performed using sklearn is via the sklearn.utils.class_weight function and applied to any sklearn classifier (and within keras).

如何使用sklearn執(zhí)行此操作的示例是通過sklearn.utils.class_weight函數(shù)并將其應用于任何sklearn分類器(以及在keras中)。

from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
model.fit(X_train, y_train, class_weight=class_weights)

In this case, we have set the instances to be ‘balanced’, meaning that we will treat these instances to have balanced weighting based on their relative number of points — this is what I would recommend unless you have a good reason for setting the values yourself. If you have three classes and wanted to weight one of them 10x larger and another 20x larger (because there are 10x and 20x fewer of these points in the dataset than the majority class), then we can rewrite this as:

在這種情況下，我們將實例設置為“平衡”，這意味著我們將根據(jù)它們的相對點數(shù)將這些實例視為具有均衡的權重-這是我的建議，除非您有充分的理由來設置值你自己如果您有三個類別，并且想要將其中一個類別的權重放大10倍，將另一個類別的權重增大20倍(因為數(shù)據(jù)集中這些點的數(shù)量比多數(shù)類別少10倍和20倍)，則可以將其重寫為：

class_weight = {0: 0.1,
1: 1.,
2: 2.}

Some authors claim that cost-sensitive learning is slightly more effective than random or directed over- or under-sampling, although all approaches are helpful, and directed oversampling, is close to cost-sensitive learning in efficacy. Personally, when I am working on a machine learning problem I will use cost-sensitive learning because it is much simpler to implement and communicate to individuals. However, there may be additional aspects of using sampling techniques that provide superior results of which I am not aware.

一些作者聲稱，成本敏感型學習比隨機或有針對性的過度采樣或欠采樣略有效果，盡管所有方法都是有幫助的，有針對性的過度采樣在效果上接近于成本敏感型學習。就個人而言，當我處理機器學習問題時，我將使用成本敏感型學習，因為它易于實現(xiàn)并與個人進行交流。但是，使用采樣技術可能存在其他方面，這些方面提供了我所不知道的優(yōu)異結果。

評估指標 (Assessment Metrics)

In this section, I outline several metrics that can be used to analyze the performance of a classifier trained to solve a binary classification problem. These include (1) the confusion matrix, (2) binary classification metrics, (3) the receiver operating characteristic curve, and (4) the precision-recall curve.

在本節(jié)中，我概述了幾個可用于分析經(jīng)過訓練以解決二進制分類問題的分類器的性能的指標。其中包括(1)混淆矩陣，(2)二進制分類指標，(3)接收器工作特性曲線和(4)精確調用曲線。

混淆矩陣 (Confusion Matrix)

Despite what you may have garnered from its name, a confusion matrix is decidedly confusing. A confusion matrix is the most basic form of assessment of a binary classifier. Given the prediction outputs of our classifier and the true response variable, a confusion matrix tells us how many of our predictions are correct for each class, and how many are incorrect. The confusion matrix provides a simple visualization of the performance of a classifier based on these factors.

盡管您可能從它的名字中學到了什么，但是混亂矩陣顯然令人困惑。混淆矩陣是二進制分類器評估的最基本形式。給定分類器的預測輸出和真實的響應變量，混淆矩陣會告訴我們每個類別正確的預測有多少，不正確的預測有多少。混淆矩陣基于這些因素提供了分類器性能的簡單可視化。

Here is an example of a confusion matrix:

這是一個混淆矩陣的示例：

Hopefully what this is showing is relatively clear. The TN cell tells us the number of true positives: the number of positive samples that I predicted were positive.

希望這顯示的是相對清楚的。 TN細胞告訴我們真正的陽性數(shù)量：我預測的陽性樣品數(shù)量為陽性。

The TP cell tells us the number of true negatives: the number of negative samples that I predicted were negative.

TP單元告訴我們真實陰性的數(shù)量：我預測的陰性樣品的數(shù)量為陰性。

The FP cell tells us the number of false positives: the number of negative samples that I predicted were positive.

FP細胞告訴我們假陽性的數(shù)量：我預測的陰性樣品的數(shù)量是陽性的。

The FN cell tells us the number of false negatives: the number of positive samples that I predicted were positive.

FN細胞告訴我們假陰性的數(shù)量：我預測的陽性樣品的數(shù)量為陽性。

These numbers are very important as they form the basis of the binary classification metrics discussed next.

這些數(shù)字非常重要，因為它們構成了下面討論的二進制分類指標的基礎。

二進制分類指標 (Binary Classification Metrics)

There are a plethora of single-value metrics for binary classification. As such, only a few of the most commonly used ones and their different formulations are presented here, more details can be found on scoring metrics in the sklearn documentation and on their relation to confusion matrices and ROC curves (discussed in the next section) here.

二進制分類有很多單值指標。因此，此處僅介紹一些最常用的方法及其不同的公式，有關更多詳細信息，請參見sklearn文檔中的評分指標以及它們與混淆矩陣和ROC曲線的關系(在下一節(jié)中討論) 。。

Arguably the most important five metrics for binary classification are: (1) precision, (2) recall, (3) F1 score, (4) accuracy, and (5) specificity.

可以說，二元分類最重要的五個指標是：(1)精度，(2)回憶，(3)F1得分，(4)準確性和(5)特異性。

Precision. Precision provides us with the answer to the question “Of all my positive predictions, what proportion of them are correct?”. If you have an algorithm that predicts all of the positive class correctly but also has a large portion of false positives, the precision will be small. It makes sense why this is called precision since it is a measure of how ‘precise’ our predictions are.

精確。 Precision為我們提供了以下問題的答案： “在我所有的積極預測中，有多少是正確的？” 。如果您有一種算法可以正確預測所有肯定分類，但也有很大一部分誤報，則精度會很小。之所以將其稱為“精度”是有道理的，因為它可以衡量我們的預測有多“精確”。

Recall. Recall provides us with the answer to a different question “Of all of the positive samples, what proportion did I predict correctly?”. Instead of false positives, we are now interested in false negatives. These are items that our algorithm missed, and are often the most egregious errors (e.g. failing to diagnose something with cancer that actually has cancer, failing to discover malware when it is present, or failing to spot a defective item). The name ‘recall’ also makes sense for this circumstance as we are seeing how many of the samples the algorithm was able to pick up on.

召回。 Recall為我們提供了一個不同問題的答案： “在所有陽性樣本中，我正確預測的比例是多少？” 。現(xiàn)在，我們對假陰性感興趣了，而不是假陽性。這些是我們的算法遺漏的項目，并且通常是最嚴重的錯誤(例如，未能診斷出確實患有癌癥的癌癥，無法發(fā)現(xiàn)惡意軟件或存在缺陷的項目)。在這種情況下，“召回”這個名稱也很有意義，因為我們看到了該算法能夠提取多少個樣本。

It should be clear that these questions, whilst related, are substantially different to each other. It is possible to have a very high precision and simultaneously have a low recall, and vice versa. For example, if you predicted the majority class every time, you would have 100% recall on the majority class, but you would then get a lot of false positives from the minority class.

應當明確的是，這些問題雖然相關，但彼此之間卻有很大不同。可能有很高的精度，同時召回率也很低，反之亦然。例如，如果您每次都預測多數(shù)派，則多數(shù)派將有100％的回憶率，但隨后您將從少數(shù)派中得到很多誤報。

One other important point to make is that precision and recall can be determined for each individual class. That is, we can talk about the precision of class A, or the precision of class B, and they will have different values — when doing this, we assume that the class we are interested in is the positive class, regardless of its numeric value.

另一個重要的觀點是， 可以為每個單獨的類確定精度和召回率 。也就是說，我們可以談論類A的精度或類B的精度，并且它們將具有不同的值-這樣做時，我們假設我們感興趣的類是正類，而不管其數(shù)值如何。

.。

F1 Score. The F1 score is a single-value metric that combines precision and recall by using the harmonic mean (a fancy type of averaging). The β parameter is a strictly positive value that is used to describe the relative importance of recall to precision. A larger β value puts a higher emphasis on recall than precision, whilst a smaller value puts less emphasis. If the value is 1, precision and recall are treated with equal weighting.

F1分數(shù)。 F1分數(shù)是一個單值指標，通過使用諧波均值(一種奇特的平均值)將精度和召回率結合在一起。 β參數(shù)是一個嚴格的正值，用于描述召回對精度的相對重要性。 β值較大時，對查全率的重視程度要高于精度，而β值較小時，對查全率的重視程度較低。如果該值為1，則精度和召回率將以相等的權重處理。

What does a high F1 score mean? It suggests that both the precision and recall have high values — this is good and is what you would hope to see upon generating a well-functioning classification model on an imbalanced dataset. A low value indicates that either precision or recall is low, and maybe a call for concern. Good F1 scores are generally lower than good accuracies (in many situations, an F1 score of 0.5 would be considered pretty good, such as predicting breast cancer from mammograms).

F1高分意味著什么？它表明精度和查全率都具有很高的值-這很好，這是在不平衡數(shù)據(jù)集上生成功能良好的分類模型時希望看到的。較低的值表示準確性或召回率較低，可能表示需要關注。良好的F1分數(shù)通常低于良好的準確性(在許多情況下，F1分數(shù)0.5被認為是相當不錯的，例如根據(jù)乳房X線照片預測乳腺癌)。

Specificity. Simply stated, specificity is the recall of negative values. It answers the question “Of all of my negative predictions, what proportion of them are correct?”. This may be important in situations where examining the relative proportion of false positives is necessary.

特異性。 簡而言之，特異性就是召回負值。它回答了一個問題： “在我所有的負面預測中，有多少比例是正確的？” 。這在需要檢查假陽性的相對比例的情況下可能很重要。

Macro, Micro, and Weighted Scores

宏觀，微觀和加權分數(shù)

This is where things get a little complicated. Anyone who has delved into these metrics on sklearn may have noticed that we can refer to the recall-macro or f1-weighted score.

這會使事情變得有些復雜。認真研究了sklearn的這些指標的任何人都可能已經(jīng)注意到，我們可以參考召回宏或f1加權得分。

A macro-F1 score is the average of F1 scores across each class.

宏觀F1分數(shù)是每個課程中F1分數(shù)的平均值。

This is most useful if we have many classes and we are interested in the average F1 score for each class. If you only care about the F1 score for one class, you probably won’t need a macro-F1 score.

如果我們有很多班，并且我們對每個班的平均F1成績感興趣，這將是最有用的。如果您只關心一個班級的F1分數(shù)，則可能不需要宏F1分數(shù)。

A micro-F1 score takes all of the true positives, false positives, and false negatives from all the classes and calculates the F1 score.

微型F1分數(shù)采用所有類別中的所有真實肯定，錯誤肯定和錯誤否定，并計算F1得分。

The micro-F1 score is pretty similar in utility to the macro-F1 score as it gives an aggregate performance of a classifier over multiple classes. That being said, they will give different results and understand the underlying difference in that result may be informative for a given application.

微型F1得分的效用與宏觀F1得分非常相似，因為它提供了多個類別的分類器的綜合性能。話雖如此，他們將給出不同的結果，并了解該結果的根本差異可能對給定的應用程序有幫助。

A weighted-F1 score is the same as the macro-F1 score, but each of the class-specific F1 scores is scaled by the relative number of samples from that class.

加權F1分數(shù)與宏F1分數(shù)相同，但是每個類別特定的F1分數(shù)均根據(jù)該類別的樣本的相對數(shù)量進行縮放。

In this case, N refers to the proportion of samples in the dataset belonging to a single class. For class A, where class A is the majority class, this might be equal to 0.8 (80%). The values for B and C might be 0.15 and 0.05, respectively.

在這種情況下， N是指數(shù)據(jù)集中屬于單個類別的樣本所占的比例。對于A類，其中A類為多數(shù)類，這可能等于0.8(80％)。 B和C的值分別為0.15和0.05。

For a highly imbalanced dataset, a large weighted-F1 score might be somewhat misleading because it is overly influenced by the majority class.

對于高度不平衡的數(shù)據(jù)集，較大的F1加權分數(shù)可能會引起誤導，因為它受到多數(shù)類別的過度影響。

Other Metrics

其他指標

Some other metrics that you may see around that can be informative for binary classification (and multiclass classification to some extent) are:

您可能會發(fā)現(xiàn)的一些其他指標可對二進制分類(在某種程度上，以及多類分類)有所幫助：

Accuracy. If you are reading this, I would imagine you are already familiar with accuracy, but perhaps not so familiar with the others. Cast in the light of a metric for a confusion matrix, the accuracy can be described as the ratio of true predictions (positive and negative) to the sum of the total number of positive and negative samples.

準確性。 如果您正在閱讀本文，我想您已經(jīng)對準確性很熟悉，但對其他準確性可能不太了解。根據(jù)混淆矩陣的度量標準，可以將準確度描述為真實預測(陽性和陰性)與陽性和陰性樣本總數(shù)之和的比率。

G-Mean. A less common metric that is somewhat analogous to the F1 score is the G-Mean. This is often cast in two different formulations, the first being the precision-recall g-mean, and the second being the sensitivity-specificity g-mean. They can be used in a similar manner to the F1 score in terms of analyzing algorithmic performance. The precision-recall g-mean can also be referred to as the Fowlkes-Mallows Index.

G均值。 G均值是一種不太常見的指標，與F1分數(shù)有些相似。通常用兩種不同的公式表示，第一種是精確調用g均值，第二種是敏感性特異性g均值。就分析算法性能而言，它們可以與F1分數(shù)類似的方式使用。精確調用g均值也可以稱為Fowlkes-Mallows索引。

There are many other metrics that can be used, but most have specialized use cases and offer little additional utility over the metrics described here. Other metrics the reader may be interested in viewing are balanced accuracy, Matthews correlation coefficient, markedness, and informedness.

可以使用許多其他指標，但是大多數(shù)指標都有專門的用例，并且與此處描述的指標相比，幾乎沒有其他用途。讀者可能感興趣的其他指標是平衡的準確性，馬修斯相關系數(shù) ，標記性和信息靈通性。

Receiver Operating Characteristic (ROC) Curve

接收器工作特性(ROC)曲線

An ROC curve is a two-dimensional graph to depicts trade-offs between benefits (true positives) and costs (false positives). It displays a relation between sensitivity and specificity for a given classifier (binary problems, parameterized classifier or a score classification).

ROC曲線是一個二維圖形，用于描述收益(真實肯定)和成本(錯誤真實)之間的權衡。它顯示了給定分類器(二進制問題，參數(shù)化分類器或分數(shù)分類)的敏感性和特異性之間的關系。

Here is an example of an ROC curve.

這是ROC曲線的示例。

There is a lot to unpack here. Firstly, the dotted line through the center corresponds to a classifier that acts as a ‘coin flip’. That is, it is correct roughly 50% of the time and is the worst possible classifier (we are just guessing). This acts as our baseline, against which we can compare all other classifiers — these classifiers should be closer to the top left corner of the plot since we want high true positive rates in all cases.

這里有很多要解壓的東西。首先，通過中心的虛線對應于充當“硬幣翻轉”的分類器。也就是說，大約50％的時間是正確的，并且是最糟糕的分類器(我們只是在猜測)。這是我們的基準，可以與所有其他分類器進行比較-這些分類器應更靠近圖的左上角，因為在所有情況下我們都希望有較高的真實陽性率。

It should be noted that an ROC curve does not assess a group of classifiers. Rather, it examines a single classifier over a set of classification thresholds.

應該注意的是，ROC曲線不評估一組分類器。而是，它在一組分類閾值上檢查單個分類器。

What does this mean? It means that for one point, I take my classifier and set the threshold to be 0.3 (30% propensity) and then assess the true positive and false positive rates.

這是什么意思？這意味著，我將分類器的閾值設置為0.3(傾向性為30％)，然后評估真實的陽性和假陽性率。

True Positive Rate: Percentage of true positives (to the sum of true positives and false negatives) generated by the combination of a specific classifier and classification threshold.

真實肯定率： 特定分類器和分類閾值的組合所生成的 真實肯定率 (相對于真實肯定率和錯誤否定率)。

False Positive Rate: Percentage of false positives (to the sum of false positives and true negatives) generated by the combination of a specific classifier and classification threshold.

誤報率： 特定分類器和分類閾值的組合所產生的誤報率(占誤報率和真實否定值的總和)。

This gives me two numbers, which I can then plot on the curve. I then take another threshold, say 0.4, and repeat this process. After doing this for every threshold of interest (perhaps in 0.1, 0.01, or 0.001 increments), we have constructed an ROC curve for this classifier.

這給了我兩個數(shù)字，然后可以在曲線上繪制它們。然后，我將另一個閾值設為0.4，然后重復此過程。在對每個感興趣的閾值執(zhí)行此操作后(可能以0.1、0.01或0.001為增量)，我們?yōu)榇朔诸惼鳂嫿薘OC曲線。

An example ROC curve showing how an individual point is plotted. A classifier is selected along with a classification threshold. Following this, the true positive rate and false positive rate for this combination of classification and threshold are calculated and subsequently plotted.示例ROC曲線顯示了如何繪制單個點。選擇分類器以及分類閾值。此后，針對分類和閾值的這種組合，計算出真陽性率和假陽性率，并隨后進行繪圖。

What is the point of doing this? Depending on your application, you may be very averse to false positives as they may be very costly (e.g. launches of nuclear missiles) and thus would like a classifier that has a very low false-positive rate. Conversely, you may not care so much about having a highfalse positive rate as long as you get a high true positive rate (stopping most events of fraud may be worth it even if you have to check many more occurrences that are flagged by the algorithm as flawed). For the optimal balance between these two ratios (where false positives and false negatives are equally costly), we would take the classification threshold which results in the minimum diagonal distance from the top left corner.

這樣做有什么意義？根據(jù)您的應用，您可能會反對誤報，因為誤報的代價可能很高(例如，發(fā)射核導彈)，因此希望分類器的誤報率非常低。相反，只要您獲得很高的真實陽性率，您可能就不會太在意高假陽性率(即使必須檢查該算法標記為的更多事件，停止大多數(shù)欺詐事件也是值得的)有缺陷的)。為了在這兩個比率之間實現(xiàn)最佳平衡(假陽性和假陰性的代價均相同)，我們將采用分類閾值，以使距左上角的對角線距離最小。

Why does the top left corner correspond to the ideal classifier? The ideal point on the ROC curve would be (0,100), that is, all positive examples are classified correctly and no negative examples are misclassified as positive. In a perfect classifier, there would be no misclassification!

為什么左上角對應于理想分類器？ ROC曲線上的理想點是(0,100) ，也就是說，所有正樣本都正確分類，沒有負樣本被誤分類為正樣本。在一個完美的分類器中，不會出現(xiàn)分類錯誤！

Whilst a graph may not seem pretty useful in itself, it is helpful in comparing classifiers. One particular metric, the Area Under Curve (AUC) score, allows us to compare classifiers by comparing the total area underneath the line produced on the ROC curve. For an ideal classifier, the AUC equals 1, since we are multiplying 100% (1.0) true positive rate by 100% (1.0) false-positive rate. If a particular classifier has an ROC of 0.6 and another has an ROC of 0.8, the latter is clearly a better classifier. The AUC has the benefit that it is independent of the decision criteria — the classification threshold — and thus makes it easier to compare these classifiers.

雖然圖本身似乎不太有用，但它有助于比較分類器。一種特殊的度量標準，即曲線下面積(AUC)得分，使我們可以通過比較ROC曲線上生成的線下的總面積來比較分類器。 For an ideal classifier, the AUC equals 1, since we are multiplying 100% (1.0) true positive rate by 100% (1.0) false-positive rate. If a particular classifier has an ROC of 0.6 and another has an ROC of 0.8, the latter is clearly a better classifier. The AUC has the benefit that it is independent of the decision criteria — the classification threshold — and thus makes it easier to compare these classifiers.

A question may have come to mind now — what if some classifiers are better at lower thresholds and some are better at higher thresholds? This is where the ROC convex hull comes in. The convex hull provides us with a method of identifying potentially optimal classifiers — even though we may not have directly observed them, we can infer their existence. Consider the following diagram:

Source: Source: QuoraQuora

Given a family of ROC curves, the ROC convex hull can include points that are more towards the top left corner (perfect classifier) of the ROC space. If a line passes through a point on the convex hull, then there is no other line with the same slope passing through another point with a larger true positive intercept. Thus, the classifier at that point is optimal under any distribution assumptions in tandem with that slope. This is perhaps easier to understand after examining the image.

How does undersampling/oversampling influence the ROC curve? A famous paper on SMOTE (discussed previously) titled “SMOTE: Synthetic Minority Over-sampling Technique” outlines that by undersampling the majority class, we force the ROC curve to move up and to the right, and thus has the potential to increase the AUC of a given classifier (this is essentially just validation that SMOTE functions correctly, as expected). Similarly, oversampling the minority class has a similar impact.

How does undersampling/oversampling influence the ROC curve? A famous paper on SMOTE (discussed previously) titled “ SMOTE: Synthetic Minority Over-sampling Technique ” outlines that by undersampling the majority class, we force the ROC curve to move up and to the right, and thus has the potential to increase the AUC of a given classifier (this is essentially just validation that SMOTE functions correctly, as expected). Similarly, oversampling the minority class has a similar impact.

Source: Source : ResearchgateResearchgate

Precision-Recall (PR) Curves

An analogous diagram to an ROC curve can be recast from ROC space and reformulated into PR space. These diagrams are in many ways analogous to the ROC curve, but instead of plotting recall against fallout (true positive rate vs. false positive rate), we are instead plotting precision against recall. This produces a somewhat mirror-image (the curve itself will look somewhat different) of the ROC curve in the sense that the top right corner of a PR curve designates the ideal classifier. This can often be more understandable than an ROC curve but provides very similar information. The area under a PR curve is often called mAP and is analogous to the AUC in ROC space.

Source: Source: Researchgate — Ten quick tips for machine learning in computational biologyResearchgate — Ten quick tips for machine learning in computational biology

Final Comments (Final Comments)

Imbalanced datasets are underrepresented (no pun intended) in many data science programs contrary to their prevalence and importance in many industrial machine learning applications. It is the job of the data scientist to be able to recognize when a dataset is imbalanced and follow procedures and utilize metrics that allow this imbalance to be sufficiently understood and controlled.

I hope that in the course of reading this article you have learned something about dealing with imbalanced datasets and are in the future will be comfortable in the face of such imbalanced problems. If you are a serious data scientist, it is only a matter of time before one of these applications will pop up!

Newsletter (Newsletter)

For updates on new blog posts and extra content, sign up for my newsletter.

翻譯自: https://towardsdatascience.com/guide-to-classification-on-imbalanced-datasets-d6653aa5fa23

數(shù)據(jù)安全分類分級實施指南

總結

以上是生活随笔為你收集整理的数据安全分类分级实施指南_不平衡数据集分类指南的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： RTX51tiny 延时长度计算
下一篇：基于jeecgboot的支持flowab