机器算法验证 - 如何稳健地识别忽略异常值的地板趋势线？ - 吾爱随笔录

如何稳健地识别忽略异常值的地板趋势线？

机器算法验证线性模型算法异常值趋势

2022-03-20 17:22:24

我有以下数据集（每个图表上的 x 和 y 比例相同）：

在此处输入图像描述

眼睛在每张图表上挑选出一条向上倾斜的底部趋势线，该趋势线从左到右向上延伸，紧贴大量蓝点的底部，下方有一些异常值。每个图表的趋势线都不同。

这对眼睛来说很容易辨别。但是有没有一种强大的算法来做到这一点？

我考虑编写一个目标函数来计算趋势线的残差，丢弃（比如说）1% 的低于趋势线的点，然后对残差进行加权，这样线上方的点对目标的贡献很小功能，线下的点贡献很大，让线下抱抱大头。然后找到最小化该目标函数的趋势线。

有没有更稳健的方法？

3个回答

我认为线性分位数回归将接近您想要的。这适合一条线，以便每个 x 值的预测值接近以 x 为条件的响应的所选分位数。

这是一个 R 包： http ://cran.r-project.org/web/packages/quantreg/index.html

例如，您可以尝试 1% 的分位数，看看是否可以避免异常值。您可以调整您选择的分位数，直到它看起来正确。

如果您想更有原则地决定异常值从哪里开始，我认为您需要对数据分布做出更多假设。

是否只有 1 条趋势线？可能不是。您在“底部的视觉趋势线”下方有异常点。这些将如何被“忽略”以捕捉眼睛看到的主要地板趋势线并且不受它们的影响。为了检测这些趋势并减少它们的影响，我们希望同时检测趋势线和与趋势不一致的脉冲。如果您可以按照您所说的减少 xy 观察值，然后发布减少的集合，我可能会使用我所知道的唯一一款在考虑 ARIMA 结构和脉冲的同时处理趋势检测的商用软件。据我所知，没有免费的东西可用，当然复制人眼的启发式方法是不可公开的。

也许尝试用 R 语言运行这个定制的代码。如果你从未使用过 R，请查看http://twotorials.com/开始；）

# create eight sets of a thousand random x values
x1 <- rnorm( 1000 , mean = 1 )
x2 <- rnorm( 1000 , mean = 2 )
x3 <- rnorm( 1000 , mean = 3 )
x4 <- rnorm( 1000 , mean = 4 )
x5 <- rnorm( 1000 , mean = 5 )
x6 <- rnorm( 1000 , mean = 1 )
x7 <- rnorm( 1000 , mean = 1 )
x8 <- rnorm( 1000 , mean = 3 )

# create eight sets of a thousand random y values
y1 <- rnorm( 1000 , mean = 1 )
y2 <- rnorm( 1000 , mean = 2 )
y3 <- rnorm( 1000 , mean = 3 )
y4 <- rnorm( 1000 , mean = 4 )
y5 <- rnorm( 1000 , mean = 5 )
y6 <- rnorm( 1000 , mean = 5 )
y7 <- rnorm( 1000 , mean = 3 )
y8 <- rnorm( 1000 , mean = 5 )

# combine all of these values into two vectors
x <- c( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 )
y <- c( y1 , y2 , y3, y4 , y5 , y6 , y7 , y8 )

# this distribution looks like your example distribution
plot( x , y )

# along the x axis, figure out some reasonable intervals to "bin" the data

# let's say you want one bin per 100 points..
num.bins <- length( x ) / 100

# figure out what quantiles to cut your bins at
quantile.probs <- seq( 0 , 1 , length.out = num.bins )

# slice up your `x` data into that many equal bins
bin.cutpoints <- quantile( x , quantile.probs )

# now let's look at just the first bin #

# positions within the first bin
first.bin <- which( bin.cutpoints [ 1 ] <= x & x < bin.cutpoints [ 2 ] )

# x midpoint between first two cutpoints
first.midpoint <- as.numeric( bin.cutpoints [ 1 ] + ( bin.cutpoints [ 2 ] - bin.cutpoints [ 1 ] ) / 2 )

first.midpoint

# since you wanted to discard 1% of all points, choose the 1% quantile cutoff point
one.percent.cutoff <- round( quantile( 1:length( y[ first.bin ] ) , 0.01 ) )

# find the point at the edge of the first percentile within this bin
first.percentile <- sort( y[ first.bin ] )[ one.percent.cutoff ]

# and there's your `y` value
first.percentile


# create two empty vectors to start storing values
low.x <- NULL
low.y <- NULL

# repeat this process for all bins:
for ( i in 2:length( bin.cutpoints ) ){

    this.bin <- which( bin.cutpoints [ i - 1 ] <= x & x < bin.cutpoints [ i ] )

    this.midpoint <- as.numeric( bin.cutpoints [ i ] + ( bin.cutpoints [ i ] - bin.cutpoints [ i ] ) / 2 )

    low.x <- c( low.x , this.midpoint )

    # since you wanted to discard 1% of all points, choose the 1% quantile cutoff point
    one.percent.cutoff <- round( quantile( 1:length( y[ this.bin ] ) , 0.01 ) )

    # find the point at the edge of the first percentile within this bin
    first.percentile <- sort( y[ this.bin ] )[ one.percent.cutoff ]

    # and there's your `y` value
    first.percentile

    low.y <- c( low.y , first.percentile )
}

# plot your original points
plot( x , y , main = 'one bin per hundred points' )

# RE-plot the points that had the second-lowest y value within each "bin"
# so you can see exactly what line you're best-fitting
points( low.x , low.y , col = "red" , pch = 19 )

# draw your line of best fit
abline( lm( low.y ~ low.x ) )

这是结果.. 在此处输入图像描述

请注意，它对每个 bin 的大小很敏感。

在此处输入图像描述

其它你可能感兴趣的问题

上一篇0 到 1 之间的连续因变量拟合 sigmoidal 函数下一篇如何对回归模型进行外部验证