missforest_missforest最佳丢失数据插补算法
missforest
Missing data often plagues real-world datasets, and hence there is tremendous value in imputing, or filling in, the missing values. Unfortunately, standard ‘lazy’ imputation methods like simply using the column median or average don’t work well.
丟失的數(shù)據(jù)通常困擾著現(xiàn)實(shí)世界的數(shù)據(jù)集,因此,估算或填寫丟失的值具有巨大的價(jià)值。 不幸的是,標(biāo)準(zhǔn)的“惰性”插補(bǔ)方法(例如僅使用列中位數(shù)或平均值)效果不佳。
On the other hand, KNN is a machine-learning based imputation algorithm that has seen success but requires tuning of the parameter k and additionally, is vulnerable to many of KNN’s weaknesses, like being sensitive to being outliers and noise. Additionally, depending on circumstances, it can be computationally expensive, requiring the entire dataset to be stored and computing distances between every pair of points.
另一方面,KNN是一種基于機(jī)器學(xué)習(xí)的插補(bǔ)算法,它已經(jīng)取得了成功,但需要調(diào)整參數(shù)k,而且容易受到KNN的許多弱點(diǎn)的影響,例如對(duì)異常值和噪聲敏感。 另外,根據(jù)情況,計(jì)算可能會(huì)很昂貴,需要存儲(chǔ)整個(gè)數(shù)據(jù)集并計(jì)算每對(duì)點(diǎn)之間的距離。
MissForest is another machine learning-based data imputation algorithm that operates on the Random Forest algorithm. Stekhoven and Buhlmann, creators of the algorithm, conducted a study in 2011 in which imputation methods were compared on datasets with randomly introduced missing values. MissForest outperformed all other algorithms in all metrics, including KNN-Impute, in some cases by over 50%.
MissForest是基于隨機(jī)森林算法的另一種基于機(jī)器學(xué)習(xí)的數(shù)據(jù)插補(bǔ)算法。 該算法的創(chuàng)建者Stekhoven和Buhlmann于2011年進(jìn)行了一項(xiàng)研究,該研究在具有隨機(jī)引入的缺失值的數(shù)據(jù)集上比較了插補(bǔ)方法。 在所有指標(biāo)上,MissForest的性能均優(yōu)于其他所有算法,包括KNN-Impute,在某些情況下超過50%。
First, the missing values are filled in using median/mode imputation. Then, we mark the missing values as ‘Predict’ and the others as training rows, which are fed into a Random Forest model trained to predict, in this case, Age based on Score. The generated prediction for that row is then filled in to produce a transformed dataset.
首先,使用中位數(shù)/眾數(shù)插補(bǔ)來填充缺失值。 然后,我們將缺失的值標(biāo)記為'Predict',將其他值標(biāo)記為訓(xùn)練行,將其輸入經(jīng)過訓(xùn)練的Random Forest模型中,該模型用于預(yù)測基于Score Age 。 然后填寫針對(duì)該行生成的預(yù)測,以生成轉(zhuǎn)換后的數(shù)據(jù)集。
Assume that the dataset is truncated. Image created by author.假設(shè)數(shù)據(jù)集被截?cái)唷?圖片由作者創(chuàng)建。This process of looping through missing data points repeats several times, each iteration improving on better and better data. It’s like standing on a pile of rocks while continually adding more to raise yourself: the model uses its current position to elevate itself further.
這種遍歷缺失數(shù)據(jù)點(diǎn)的循環(huán)過程會(huì)重復(fù)幾次,每次迭代都會(huì)改善越來越好的數(shù)據(jù)。 這就像站在一堆巖石上,而不斷增加更多東西以提高自己:模型使用其當(dāng)前位置進(jìn)一步提升自己。
The model may decide in the following iterations to adjust predictions or to keep them the same.
模型可以在接下來的迭代中決定調(diào)整預(yù)測或使其保持不變。
Image created by author圖片由作者創(chuàng)建Iterations continue until some stopping criteria is met or after a certain number of iterations has elapsed. As a general rule, datasets become well imputed after four to five iterations, but it depends on the size and amount of missing data.
迭代一直持續(xù)到滿足某些停止條件或經(jīng)過一定數(shù)量的迭代之后。 通常,經(jīng)過四到五次迭代后,數(shù)據(jù)集的插補(bǔ)效果會(huì)很好,但這取決于丟失數(shù)據(jù)的大小和數(shù)量。
There are many benefits of using MissForest. For one, it can be applied to mixed data types, numerical and categorical. Using KNN-Impute on categorical data requires it to be first converted into some numerical measure. This scale (usually 0/1 with dummy variables) is almost always incompatible with the scales of other dimensions, so the data must be standardized.
使用MissForest有很多好處。 一方面,它可以應(yīng)用于數(shù)值和分類的混合數(shù)據(jù)類型。 對(duì)分類數(shù)據(jù)使用KNN-Impute要求首先將其轉(zhuǎn)換為某種數(shù)字量度。 此比例(通常為0/1,帶有虛擬變量 )幾乎總是與其他尺寸的比例不兼容,因此必須對(duì)數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化。
In a similar vein, no pre-processing is required. Since KNN uses na?ve Euclidean distances, all sorts of actions like categorical encoding, standardization, normalization, scaling, data splitting, etc. need to be taken to ensure its success. On the other hand, Random Forest can handle these aspects of data because it doesn’t make assumptions of feature relationships like K-Nearest Neighbors does.
同樣,不需要預(yù)處理。 由于KNN使用樸素的歐幾里得距離,因此需要采取各種措施,例如分類編碼,標(biāo)準(zhǔn)化,歸一化,縮放,數(shù)據(jù)拆分等,以確保其成功。 另一方面,Random Forest可以處理數(shù)據(jù)的這些方面,因?yàn)樗鼪]有像K-Nearest Neighbors那樣假設(shè)特征關(guān)系。
MissForest is also robust to noisy data and multicollinearity, since random-forests have built-in feature selection (evaluating entropy and information gain). KNN-Impute yields poor predictions when datasets have weak predictors or heavy correlation between features.
MissForest還對(duì)嘈雜的數(shù)據(jù)和多重共線性具有魯棒性,因?yàn)殡S機(jī)森林具有內(nèi)置的特征選擇(評(píng)估熵和信息增益 )。 當(dāng)數(shù)據(jù)集的預(yù)測變量較弱或特征之間的相關(guān)性很強(qiáng)時(shí),KNN-Impute的預(yù)測結(jié)果很差。
The results of KNN are also heavily determined by a value of k, which must be discovered on what is essentially a try-it-all approach. On the other hand, Random Forest is non-parametric, so there is no tuning required. It can also work with high-dimensional data, and is not prone to the Curse of Dimensionality to the heavy extent KNN-Impute is.
KNN的結(jié)果在很大程度上還取決于k的值,該值必須在本質(zhì)上是一種“萬能嘗試”方法中進(jìn)行發(fā)現(xiàn)。 另一方面,“隨機(jī)森林”是非參數(shù)的,因此不需要調(diào)整。 它也可以處理高維數(shù)據(jù),并且在很大程度上不會(huì)出現(xiàn)KNN-Impute的維數(shù)詛咒。
On the other hand, it does have some downsides. For one, even though it takes up less space, if the dataset is sufficiently small it may be more expensive to run MissForest. Additionally, it’s an algorithm, not a model object; this means it must be run every time data is imputed, which may not work in some production environments.
另一方面,它確實(shí)有一些缺點(diǎn)。 一方面,即使占用的空間較小,但如果數(shù)據(jù)集足夠小,則運(yùn)行MissForest可能會(huì)更昂貴。 另外,它是一種算法,而不是模型對(duì)象。 這意味著每次插補(bǔ)數(shù)據(jù)時(shí)都必須運(yùn)行它,這在某些生產(chǎn)環(huán)境中可能無法運(yùn)行。
Using MissForest is simple. In Python, it can be done through the missingpy library, which has a sklearn-like interface and has many of the same parameters as the RandomForestClassifier/RandomForestRegressor. The complete documentation can be found on GitHub here.
使用MissForest很簡單。 在Python中,這可以通過missingpy庫完成,該庫具有sklearn的界面,并且具有與RandomForestClassifier / RandomForestRegressor相同的許多參數(shù)。 完整的文檔可以在GitHub上找到 。
The model is only as good as the data, so taking proper care of the dataset is a must. Consider using MissForest next time you need to impute missing data!
該模型僅與數(shù)據(jù)一樣好,因此必須適當(dāng)注意數(shù)據(jù)集。 下次需要填寫缺少的數(shù)據(jù)時(shí),請(qǐng)考慮使用MissForest!
Thanks for reading!
謝謝閱讀!
翻譯自: https://towardsdatascience.com/missforest-the-best-missing-data-imputation-algorithm-4d01182aed3
missforest
總結(jié)
以上是生活随笔為你收集整理的missforest_missforest最佳丢失数据插补算法的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到自己血流不止啥意思
- 下一篇: 数据可视化工具_数据可视化