数据挖掘 - 优雅地删除 N 个字段中具有异常值的观测值 - 吾爱随笔录

我有一个功能。

remove_outliers <- function(x, na.rm = TRUE, ...) {

    #find position of 1st and 3rd quantile not including NA's
    qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)

    H <- 1.5 * IQR(x, na.rm = na.rm)

    y <- x
    y[x < (qnt[1] - H)] <- NA
    y[x > (qnt[2] + H)] <- NA
    x<-y

    #get rid of any NA's
    x[!is.na(x)]
}

给定这样的数据集（数字）：

功能一目了然

remove_outliers(numbers)

意味着我现在有这个：

但是，如果我有一个想要保留的 ID 怎么办，例如：

number_id    numbers
12              5
23              9
34              2
45              99
56              3
67              4

如何使用 remove_outliers 函数（或其他更适合的函数）删除异常值（99）以获取此数据：

number_id    numbers
12              5
23              9
34              2
56              3
67              4

（请注意，异常值的整个观察结果已被删除）

以及如何扩展此解决方案以处理更多变量？

我可以通过单独取出每一列并使用循环构建一个新的数据框来非常不优雅地做到这一点，但是它很难阅读并且调试起来很混乱。有没有更优雅的方式？

id <- c(12,23,34,45,56,67) num <- c(5,9,2,99,3,4) prac <- data.frame(id, num) remove_outliers <- function(x, col) { #find position of 1st and 3rd quantile not including NA's qnt <- quantile(x[ ,col], probs=c(.25, .75), na.rm = TRUE) H <- 1.5 * IQR(x[ ,col]) x[ ,col] <- ifelse(x[ ,col] < (qnt[1] - H) | x[ ,col] > (qnt[2] + H), NA, x[ ,col]) #get rid of any NA's x <- x[!is.na(x[ ,col]), ] x <- assign("dataset", x, envir = .GlobalEnv) return(x) } remove_outliers(prac, 2)