當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习常用模型:决策树_fairmodels：让我们与有偏见的机器学习模型作斗争

發(fā)布時(shí)間：2023/11/29 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习常用模型:决策树_fairmodels：让我们与有偏见的机器学习模型作斗争小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

機(jī)器學(xué)習(xí)常用模型:決策樹

TL; DR (TL;DR)

The R Package fairmodels facilitates bias detection through model visualizations. It implements a few mitigation strategies that could reduce bias. It enables easy to use checks for fairness metrics and comparison between different Machine Learning (ML) models.

R Package 公平模型 通過模型可視化促進(jìn)偏差檢測。 它實(shí)施了一些緩解策略，可以減少偏差。它使易于使用的公平性檢查檢查和不同機(jī)器學(xué)習(xí)(ML)模型之間的比較成為可能。

長版 (Long version)

Bias mitigation is an important topic in Machine Learning (ML) fairness field. For python users, there are algorithms already implemented, well-explained, and described (see AIF360). fairmodels provides an implementation of a few popular, effective bias mitigation techniques ready to make your model fairer.

偏差緩解是機(jī)器學(xué)習(xí)(ML)公平性領(lǐng)域的重要主題。對于python用戶，已經(jīng)實(shí)現(xiàn)，充分解釋和描述了算法(請參閱AIF360 )。 fairmodels提供了一些流行的有效的偏差緩解技術(shù)的實(shí)現(xiàn)，這些技術(shù)可以使您的模型更加公平。

我的模型有偏見，現(xiàn)在呢？ (I have a biased model, now what?)

Having a biased model is not the end of the world. There are lots of ways to deal with it. fairmodels implements various algorithms to help you tackle that problem. Firstly, I must describe the difference between the pre-processing algorithm and the post-processing one.

帶有偏見的模型并不是世界末日。有很多方法可以處理它。 fairmodels實(shí)現(xiàn)了各種算法來幫助您解決該問題。首先，我必須描述預(yù)處理算法和后處理算法之間的區(qū)別。

Pre-processing algorithms work on data before the model is trained. They try to mitigate the bias between privileged subgroup and unprivileged ones through inference from data.
在訓(xùn)練模型之前， 預(yù)處理算法會(huì)對數(shù)據(jù)進(jìn)行處理。他們試圖通過數(shù)據(jù)推斷來減輕特權(quán)子群體與非特權(quán)子群體之間的偏見。
Post-processing algorithms change the output of the model explained with DALEX so that its output does not favor the privileged subgroup so much.
后處理算法更改了用DALEX解釋的模型的輸出，因此其輸出不太喜歡特權(quán)子組。

這些算法如何工作？ (How do these algorithms work?)

In this section, I will briefly describe how these bias mitigation techniques work. Code for more detailed examples and some visualizations used here may be found in this vignette.

在本節(jié)中，我將簡要描述這些偏差緩解技術(shù)的工作原理。在此插圖中可以找到更詳細(xì)的示例代碼和此處使用的一些可視化效果。

前處理 (Pre-processing)

不同的沖擊消除劑(Feldman等，2015) (Disparate impact remover (Feldman et al., 2015))

(image by author) Disparate impact removing. Blue and red distribution are transformed into “middle” distribution.(作者提供的圖片)不同的影響消除。藍(lán)色和紅色分布轉(zhuǎn)換為“中間”分布。

This algorithm works on numeric, ordinal features. It changes the column values so the distributions for the unprivileged (blue) and privileged (red) subgroups are close to each other. In general, we would like our algorithm not to judge on value of the feature but rather on percentiles (e.g., hiring 20% of best applicants for the job from both subgroups). The way that this algorithm works is that it finds such distribution that minimizes earth mover’s distance. In simple words, it finds the “middle” distribution and changes values in this feature for each subgroup.

此算法適用于數(shù)字順序特征。它更改列的值，以便非特權(quán)(藍(lán)色)和特權(quán)(紅色)子組的分布彼此接近。總的來說，我們希望我們的算法不是根據(jù)功能的價(jià)值來判斷，而是根據(jù)百分位數(shù)來判斷(例如，從兩個(gè)子組中聘請20％的最佳求職者)。該算法的工作方式是找到使推土機(jī)距離最小的分布。用簡單的話說，它找到“中間”分布并為每個(gè)子組更改此功能中的值。

Reweightnig (Kamiran et al。，2012) (Reweightnig (Kamiran et al., 2012))

(image by author) In this mockup example, S=1 is a privileged subgroup. There is a weight for each unique combination of S and y.(作者提供的圖像)在此模型示例中，S = 1是特權(quán)子組。 S和y的每個(gè)唯一組合都有權(quán)重。

Reweighting is a simple but effective tool for minimizing bias. The algorithm looks at the protected attribute and on the real label. Then, it calculates the probability of assigning favorable label (y=1) assuming the protected attribute and y are independent. Of course, if there is bias, they will be statistically dependent. Then, the algorithm divides calculated theoretical probability by true, empirical probability of this event. That is how weight is created. With these 2 vectors (protected variable and y ) we can create weights vector for each observation in data. Then, we pass it to the model. Simple as that. But some models don’t have weights parameter and therefore can’t benefit from this method.

重新加權(quán)是最小化偏差的簡單但有效的工具。該算法查看受保護(hù)的屬性和實(shí)標(biāo)簽。然后，假設(shè)受保護(hù)的屬性和y獨(dú)立，則計(jì)算分配有利標(biāo)簽(y = 1)的可能性。當(dāng)然，如果存在偏差，它們將在統(tǒng)計(jì)上相關(guān)。然后，算法將計(jì)算出的理論概率除以該事件的真實(shí)，經(jīng)驗(yàn)概率。重量就是這樣產(chǎn)生的。使用這兩個(gè)向量(保護(hù)變量和y)，我們可以為數(shù)據(jù)中的每個(gè)觀察值創(chuàng)建權(quán)重向量。然后，我們將其傳遞給模型。就那么簡單。但是某些模型沒有權(quán)重參數(shù)，因此無法從該方法中受益。

重采樣(Kamiran et al。，2012) (Resampling (Kamiran et al., 2012))

(image by author) Uniform sampling. Circles denote duplication and x’s omitting of observation.(作者提供的圖片)統(tǒng)一采樣。圓圈表示重復(fù)和x省略了觀察。

Resampling is closely related to the prior method as it implicitly uses reweighting to calculate how many observations must be omitted/duplicated in a particular case. Imagine there are 2 groups, deprived (S = 0) and favored (S = 1). This method duplicates observations from a deprived subgroup when the label is positive and omits observations with a negative label. The opposite is then performed on the favored group. There are 2 types of resampling methods implemented- uniform and preferential. Uniform randomly picks observations (like in the picture) whereas preferential utilizes probabilities to pick/omit observations close to cutoff (default is 0.5).

重采樣與先前的方法密切相關(guān)，因?yàn)樗[式使用重新加權(quán)來計(jì)算在特定情況下必須省略/重復(fù)多少個(gè)觀測值。想象一下，有2個(gè)組，被剝奪(S = 0)和受青睞(S = 1)。當(dāng)標(biāo)簽為正時(shí)，此方法復(fù)制來自被剝奪子組的觀察結(jié)果，而忽略帶有負(fù)標(biāo)簽的觀察。然后對喜歡的組執(zhí)行相反的操作。有兩種重采樣方法：統(tǒng)一和優(yōu)先。均勻地隨機(jī)選擇觀測值(如圖中所示)，而“ 優(yōu)先級”則利用概率來選擇/忽略接近臨界值的觀測值(默認(rèn)值為0.5)。

后期處理 (Post-processing)

Post-processing takes place after creating an explainer. To create explainer we need the model and DALEX explainer. Gbm model will be trained on adult dataset predicting whether a certain person earns more than 50k annually.

在創(chuàng)建解釋器后進(jìn)行后處理。要?jiǎng)?chuàng)建解釋器，我們需要模型和DALEX解釋器。 Gbm模型將接受成人訓(xùn)練預(yù)測某個(gè)人的年收入是否超過5萬的數(shù)據(jù)集。

library(gbm)library(DALEX)library(fairmodels)data("adult")
adult$salary <- as.numeric(adult$salary) -1
protected <- adult$sex
adult <- adult[colnames(adult) != "sex"] # sex not specified
# making modelset.seed(1)
gbm_model <-gbm(salary ~. , data = adult, distribution = "bernoulli")
# making explainer
gbm_explainer <- explain(gbm_model,
data = adult[,-1],
y = adult$salary,
colorize = FALSE)

基于拒絕選項(xiàng)的分類(數(shù)據(jù)透視) (Kamiran等，2012) (Reject Option based Classification (pivot) (Kamiran et al., 2012))

(image by author) Red- privileged, blue- unprivileged. If the value is close (-theta + cutoff, theta + cutoff) and particular case, the probability changes place (and value) to the opposite side od cutoff.(作者提供的圖片)紅色特權(quán)，藍(lán)色特權(quán)。如果該值接近(-theta +截止值，theta +截止值)并且在特定情況下，則概率將位置(和值)更改為od截止值的另一側(cè)。

ROC pivot is implemented based on Reject Option based Classification. Algorithm switches labels if an observation is from the unprivileged group and on the left of the cutoff. The opposite is then performed for the privileged group. But there is an assumption that the observation must be close (in terms of probabilities) to cutoff. So the user must input some value theta so that the algorithm will know how close must observation be to the cutoff for the switch. But there is a catch. If just the label was changed DALEX explainer would have a hard time properly calculating the performance of the model. For that reason instead of labels, in fairmodels implementation of this algorithm that is the probabilities that are switched (pivoted). They are just moved to the other side but with equal distance to the cutoff.

ROC數(shù)據(jù)透視是基于基于拒絕選項(xiàng)的分類實(shí)現(xiàn)的。如果觀察值來自非特權(quán)組且位于截止值的左側(cè)，則算法會(huì)切換標(biāo)簽。然后對特權(quán)組執(zhí)行相反的操作。但是有一個(gè)假設(shè)，即觀察結(jié)果(在概率方面)必須接近臨界值。因此，用戶必須輸入一些值theta，以便算法將知道必須觀察到與開關(guān)的截止點(diǎn)有多接近。但是有一個(gè)問題！如果僅更改標(biāo)簽，DALEX解釋器將很難正確計(jì)算模型的性能。因此，在公平模型中 ，此算法代替了標(biāo)簽，而是切換(樞軸化)的概率。它們只是移動(dòng)到另一側(cè)，但與截止點(diǎn)的距離相等。

截止操作 (Cutoff manipulation)

(image by author) plot(ceteris_paribus_cutoff(fobject, cumulated = TRUE))(作者提供的圖像)plot(ceteris_paribus_cutoff(fobject，cumulated = TRUE))

Cutoff manipulation might be a great idea for minimizing the bias in a model. We simply choose metrics and subgroup for which the cutoff will change. The plot shows where the minimum is and for that value of cutoff parity loss will be the lowest. How to create fairness_object with the different cutoff for certain subgroup? It is easy!

截?cái)嗖僮鲗τ谧钚』Ｐ椭械钠羁赡苁且粋€(gè)好主意。我們僅選擇截止值將更改的指標(biāo)和子組。該圖顯示了最小值所在的位置，并且對于該閾值，奇偶校驗(yàn)損耗將是最低的。如何為某些子組創(chuàng)建具有不同截止值的fairness_object？這很容易！

fobject <- fairness_check(gbm_explainer,
protected = protected,
privileged = "Male",
label = "gbm_cutoff",
cutoff = list(Female = 0.35))

Now the fairness_object (fobject) is a structure with specified cutoff and it will affect both fairness metrics and performance.

現(xiàn)在， fairness_object(fobject)是具有指定截止值的結(jié)構(gòu)，它將同時(shí)影響公平性指標(biāo)和性能。

公平與準(zhǔn)確性之間的權(quán)衡 (The tradeoff between fairness and accuracy)

If we want to mitigate bias we must be aware of possible drawbacks of this action. Let’s say that Statical Parity is the most important metric for us. Lowering parity loss of this metric will (probably) result in an increase of False Positives which will cause the accuracy to drop. For this example (that you can find here) a gbm model was trained and then treated with different bias mitigation techniques.

如果我們想減輕偏見，我們必須意識(shí)到這一行動(dòng)的可能弊端。假設(shè)靜態(tài)奇偶校驗(yàn)是我們最重要的指標(biāo)。降低此度量的奇偶校驗(yàn)損失將(可能)導(dǎo)致誤報(bào)的增加，這將導(dǎo)致準(zhǔn)確性下降。對于此示例(您可以在此處找到)，對gbm模型進(jìn)行了訓(xùn)練，然后使用了不同的偏差緩解技術(shù)對其進(jìn)行了處理。

image by author圖片作者

The more we try to mitigate the bias, the less accuracy we get. This is something natural for this metric and the user should be aware of it.

我們越努力減輕偏差，獲得的準(zhǔn)確性就越低。這對于該指標(biāo)是很自然的事情，用戶應(yīng)該意識(shí)到這一點(diǎn)。

摘要 (Summary)

Debiasing methods implemented in fairmodels are certainly worth trying. They are flexible and most of them are suited for every model. Most of all they are easy to use.

在公平模型中實(shí)現(xiàn)的去偏置方法當(dāng)然值得嘗試。它們非常靈活，并且大多數(shù)適用于每種型號(hào)。最重要的是它們易于使用。

接下來要讀什么？ (What to read next?)

Blog post about introduction to fairness, problems, and solutions
關(guān)于公平，問題和解決方案介紹的博客文章
Blog post about fairness visualization
關(guān)于公平可視化的博客文章

學(xué)到更多 (Learn more)

Check the package’s GitHub website for more details
檢查軟件包的GitHub網(wǎng)站以獲取更多詳細(xì)信息
Tutorial on full capabilities of the fairmodels package
fairmodels軟件包的全部功能教程
Tutorial on bias mitigation techniques
緩解偏見技術(shù)的教程

翻譯自: https://towardsdatascience.com/fairmodels-lets-fight-with-biased-machine-learning-models-f7d66a2287fc

機(jī)器學(xué)習(xí)常用模型:決策樹

總結(jié)

以上是生活随笔為你收集整理的机器学习常用模型:决策树_fairmodels：让我们与有偏见的机器学习模型作斗争的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：梦到朋友被打劫是什么意思
下一篇： 100米队伍,从队伍后到前_我们的队伍