日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

R语言缺失值处理

發布時間:2023/12/14 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 R语言缺失值处理 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

缺失值

1. is.na 確實值位置判斷

注意: 缺失值被認為是不可比較的,即便是與缺失值自身的比較。這意味著無法使用比較運算
符來檢測缺失值是否存在。例如,邏輯測試myvar == NA的結果永遠不會為TRUE。作為
替代,你只能使用處理缺失值的函數(如本節中所述的那些)來識別出R數據對象中的缺
失值。

2. na.omit() 刪除不完整觀測

manyNAs

manyNAs(data, nORp = 0.2)
Arguments

data
A data frame with the data set.

nORp
A number controlling when a row is considered to have too many NA values (defaults to 0.2, i.e. 20% of the columns). If no rows satisfy the constraint indicated by the user, a warning is generated.
按照比例判斷缺失.

3. knnImputation K近鄰填補

library(DMwR) knnImputation(data, k = 10, scale = T, meth = "weighAvg", distData = NULL)

Arguments

Arguments
dataA data frame with the data set
kThe number of nearest neighbours to use (defaults to 10)
scaleBoolean setting if the data should be scale before finding the nearest neighbours (defaults to T)
methString indicating the method used to calculate the value to fill in each NA. Available values are ‘median’ or ‘weighAvg’ (the default).
distDataOptionally you may sepecify here a data frame containing the data set that should be used to find the neighbours. This is usefull when filling in NA values on a test set, where you should use only information from the training set. This defaults to NULL, which means that the neighbours will be searched in data

Details
This function uses the k-nearest neighbours to fill in the unknown (NA) values in a data set. For each case with any NA value it will search for its k most similar cases and use the values of these cases to fill in the unknowns.

If meth=’median’ the function will use either the median (in case of numeric variables) or the most frequent value (in case of factors), of the neighbours to fill in the NAs. If meth=’weighAvg’ the function will use a weighted average of the values of the neighbours. The weights are given by exp(-dist(k,x) where dist(k,x) is the euclidean distance between the case with NAs (x) and the neighbour k

例子:

#首先讀入程序包并對數據進行清理 library(DMwR) data(algae) algae <- algae[-manyNAs(algae), ] clean.algae <- knnImputation(algae[,1:12],k=10) > head(clean.algae)season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla a1 1 winter small medium 8.00 9.8 60.800 6.238 578.000 105.000 170.000 50.0 0.0 2 spring small medium 8.35 8.0 57.750 1.288 370.000 428.750 558.750 1.3 1.4 3 autumn small medium 8.10 11.4 40.020 5.330 346.667 125.667 187.057 15.6 3.3 4 spring small medium 8.07 4.8 77.364 2.302 98.182 61.182 138.700 1.4 3.1 5 autumn small medium 8.06 9.0 55.350 10.416 233.700 58.222 97.580 10.5 9.2 6 winter small high 8.25 13.1 65.750 9.248 430.000 18.250 56.667 28.4 15.1

4. centralImputation()中心插值

用非缺失樣本的中位數(median)對缺失數據進行插值

data(algae) cleanAlgae <- centralImputation(algae) summary(cleanAlgae)

5. complete.cases() 尋找完整數據集

x <- airquality[, -1] # x is a regression design matrix y <- airquality[, 1] # y is the corresponding response #驗證是否complete.cases結果與is.na一樣 stopifnot(complete.cases(y) != is.na(y)) #x,y共同的非缺失行的bool結果 ok <- complete.cases(x, y) #共有幾個缺失樣本 sum(!ok) # how many are not "ok" ? #得到非缺失樣本 x <- x[ok,] y <- y[ok]

6. na.fail()是否有遺漏值

DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA)) na.fail(DF)Error in na.fail.default(DF) : 對象里有遺漏值

總結

以上是生活随笔為你收集整理的R语言缺失值处理的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。