数据不平衡处理_如何处理多类不平衡数据说不可以
數(shù)據(jù)不平衡處理
重點(diǎn) (Top highlight)
One of the common problems in Machine Learning is handling the imbalanced data, in which there is a highly disproportionate in the target classes.
機(jī)器學(xué)習(xí)中的常見(jiàn)問(wèn)題之一是處理不平衡的數(shù)據(jù),其中目標(biāo)類(lèi)別的比例非常不均衡。
Hello world, this is my second blog for the Data Science community. In this blog, we are going to see how to deal with the multiclass imbalanced data problem.
大家好,這是我的第二本面向數(shù)據(jù)科學(xué)社區(qū)的博客 。 在此博客中,我們將看到如何處理多類(lèi)不平衡數(shù)據(jù)問(wèn)題。
什么是多類(lèi)不平衡數(shù)據(jù)? (What is Multiclass Imbalanced Data?)
When the target classes (two or more) of classification problems are not equally distributed, then we call it Imbalanced data. If we failed to handle this problem then the model will become a disaster because modeling using class-imbalanced data is biased in favor of the majority class.
當(dāng)分類(lèi)問(wèn)題的目標(biāo)類(lèi)別(兩個(gè)或多個(gè))沒(méi)有平均分布時(shí),我們稱(chēng)其為不平衡數(shù)據(jù)。 如果我們不能解決這個(gè)問(wèn)題,那么該模型將成為災(zāi)難,因?yàn)槭褂妙?lèi)不平衡數(shù)據(jù)進(jìn)行建模會(huì)偏向多數(shù)類(lèi)。
There are different methods of handling imbalanced data, the most common methods are Oversampling and creating synthetic samples.
處理不平衡數(shù)據(jù)的方法多種多樣,最常見(jiàn)的方法是過(guò)采樣和創(chuàng)建合成樣本。
什么是SMOTE? (What is SMOTE?)
SMOTE is an oversampling technique that generates synthetic samples from the dataset which increases the predictive power for minority classes. Even though there is no loss of information but it has a few limitations.
SMOTE是一種過(guò)采樣技術(shù),可從數(shù)據(jù)集中生成合成樣本,從而提高了少數(shù)群體的預(yù)測(cè)能力。 即使沒(méi)有信息丟失,它也有一些局限性。
Synthetic Samples合成樣品Limitations:
局限性:
So, to skip this problem, we can assign weights for the class manually with the ‘class_weight’ parameter.
因此,要跳過(guò)此問(wèn)題,我們可以使用' class_weight '參數(shù)為該類(lèi)手動(dòng)分配權(quán)重。
為什么要使用班級(jí)重量? (Why use Class weight?)
Class weights modify the loss function directly by giving a penalty to the classes with different weights. It means purposely increasing the power of the minority class and reducing the power of the majority class. Therefore, it gives better results than SMOTE.
類(lèi)權(quán)重通過(guò)對(duì)具有不同權(quán)重的類(lèi)進(jìn)行懲罰來(lái)直接修改損失函數(shù)。 這意味著有目的地增加少數(shù)群體的權(quán)力,并減少多數(shù)階級(jí)的權(quán)力。 因此,它比SMOTE提供更好的結(jié)果。
概述: (Overview:)
I aim to keep this blog very simple. We have a few most preferred techniques for getting the weights for the data which worked for my Imbalanced learning problems.
我的目的是使這個(gè)博客非常簡(jiǎn)單。 我們有一些最優(yōu)選的技術(shù)來(lái)獲取對(duì)我的失衡學(xué)習(xí)問(wèn)題有用的數(shù)據(jù)權(quán)重。
1. Sklearn實(shí)用程序: (1. Sklearn utils:)
We can get class weights using sklearn to compute the class weight. By adding those weight to the minority classes while training the model, can help the performance while classifying the classes.
我們可以使用sklearn計(jì)算班級(jí)權(quán)重。 通過(guò)在訓(xùn)練模型時(shí)將這些權(quán)重添加到少數(shù)類(lèi)中,可以在對(duì)類(lèi)進(jìn)行分類(lèi)的同時(shí)幫助提高性能。
from sklearn.utils import class_weightclass_weight = class_weight.compute_class_weight('balanced,np.unique(target_Y),
target_Y)model = LogisticRegression(class_weight = class_weight)
model.fit(X,target_Y)# ['balanced', 'calculated balanced', 'normalized'] are hyperpaameters whic we can play with.
We have a class_weight parameter for almost all the classification algorithms from Logistic regression to Catboost. But XGboost has scale_pos_weight for binary classification and sample_weights (refer 4) for both binary and multiclass problems.
對(duì)于從Logistic回歸到Catboost的幾乎所有分類(lèi)算法,我們都有一個(gè)class_weight參數(shù)。 但是XGboost具有用于二進(jìn)制分類(lèi)的scale_pos_weight和用于二進(jìn)制和多類(lèi)問(wèn)題的sample_weights(請(qǐng)參閱4)。
2.數(shù)長(zhǎng)比: (2. Counts to Length Ratio:)
Very simple and straightforward! Dividing the no. of counts of each class with the no. of rows. Then
非常簡(jiǎn)單明了! 除數(shù) 每個(gè)班級(jí)的人數(shù) 行。 然后
weights = df[target_Y].value_counts()/len(df)model = LGBMClassifier(class_weight = weights)
model.fit(X,target_Y)
3.平滑權(quán)重技術(shù): (3. Smoothen Weights Technique:)
This is one of the preferable methods of choosing weights.
這是選擇權(quán)重的首選方法之一。
labels_dict is the dictionary object contains counts of each class.
labels_dict是字典對(duì)象,包含每個(gè)類(lèi)的計(jì)數(shù)。
The log function smooths the weights for the imbalanced class.
對(duì)數(shù)函數(shù)可平滑不平衡類(lèi)的權(quán)重。
def class_weight(labels_dict,mu=0.15):total = np.sum(labels_dict.values())
keys = labels_dict.keys()
weight = dict()for i in keys:
score = np.log(mu*total/float(labels_dict[i]))
weight[i] = score if score > 1 else 1return weight# random labels_dict
labels_dict = weights = class_weight(labels_dict)model = RandomForestClassifier(class_weight = weights)
model.fit(X,target_Y)
4.樣本權(quán)重策略: (4. Sample Weight Strategy:)
This below function is different from the class_weight parameter which is used to get sample weights for the XGboost algorithm. It returns different weights for each training sample.
下面的函數(shù)不同于用于獲取XGboost算法的樣本權(quán)重的class_weight參數(shù)。 對(duì)于每個(gè)訓(xùn)練樣本,它返回不同的權(quán)重。
Sample_weight is an array of the same length as data, containing weights to apply to the model’s loss for each sample.
Sample_weight是與數(shù)據(jù)長(zhǎng)度相同的數(shù)組,其中包含權(quán)重以應(yīng)用于每個(gè)樣本的模型損失。
def BalancedSampleWeights(y_train,class_weight_coef):classes = np.unique(y_train, axis = 0)
classes.sort()
class_samples = np.bincount(y_train)
total_samples = class_samples.sum()
n_classes = len(class_samples)
weights = total_samples / (n_classes * class_samples * 1.0)
class_weight_dict = {key : value for (key, value) in zip(classes, weights)}
class_weight_dict[classes[1]] = class_weight_dict[classes[1]] *
class_weight_coef
sample_weights = [class_weight_dict[i] for i in y_train]
return sample_weights#Usage
weight=BalancedSampleWeights(
model = XGBClassifier(sample_weight = weight)
model.fit(X,
class_weights vs sample_weight:
class_weights與sample_weight:
sample_weights is used to give weights for each training sample. That means that you should pass a one-dimensional array with the exact same number of elements as your training samples.
sample_weights用于給出每個(gè)訓(xùn)練樣本的權(quán)重。 這意味著您應(yīng)該傳遞一維數(shù)組,該數(shù)組具有與訓(xùn)練樣本完全相同數(shù)量的元素。
class_weights is used to give weights for each target class. This means you should pass a weight for each class that you are trying to classify.
class_weights用于為每個(gè)目標(biāo)類(lèi)賦予權(quán)重。 這意味著您應(yīng)該為要分類(lèi)的每個(gè)類(lèi)傳遞權(quán)重。
結(jié)論: (Conclusion:)
The above are few methods of finding class weights and sample weights for your classifier. I mention almost all the techniques which worked well for my project.
上面是為分類(lèi)器找到分類(lèi)權(quán)重和樣本權(quán)重的幾種方法。 我提到幾乎所有對(duì)我的項(xiàng)目都有效的技術(shù)。
I’m requesting the readers to give a try on these techniques that could help you, if not take it as learning 😄 it may help you another time 😜
我要求讀者嘗試這些可以幫助您的技術(shù),如果不以學(xué)習(xí)為learning,那可能會(huì)再次幫助您😜
Reach me at LinkedIn 😍
在LinkedIn上到達(dá)我
翻譯自: https://towardsdatascience.com/how-to-handle-multiclass-imbalanced-data-say-no-to-smote-e9a7f393c310
數(shù)據(jù)不平衡處理
總結(jié)
以上是生活随笔為你收集整理的数据不平衡处理_如何处理多类不平衡数据说不可以的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到抓黄鳝被蛇咬是什么意思
- 下一篇: 糖药病数据集分类_使用optuna和ml