如何稳健地识别忽略异常值的地板趋势线?

机器算法验证 线性模型 算法 异常值 趋势
2022-03-20 17:22:24

我有以下数据集(每个图表上的 x 和 y 比例相同):

在此处输入图像描述

眼睛在每张图表上挑选出一条向上倾斜的底部趋势线,该趋势线从左到右向上延伸,紧贴大量蓝点的底部,下方有一些异常值。每个图表的趋势线都不同。

这对眼睛来说很容易辨别。但是有没有一种强大的算法来做到这一点?

我考虑编写一个目标函数来计算趋势线的残差,丢弃(比如说)1% 的低于趋势线的点,然后对残差进行加权,这样线上方的点对目标的贡献很小功能,线下的点贡献很大,让线下抱抱大头。然后找到最小化该目标函数的趋势线。

有没有更稳健的方法?

3个回答

我认为线性分位数回归将接近您想要的。这适合一条线,以便每个 x 值的预测值接近以 x 为条件的响应的所选分位数。

这是一个 R 包: http ://cran.r-project.org/web/packages/quantreg/index.html

例如,您可以尝试 1% 的分位数,看看是否可以避免异常值。您可以调整您选择的分位数,直到它看起来正确。

如果您想更有原则地决定异常值从哪里开始,我认为您需要对数据分布做出更多假设。

是否只有 1 条趋势线?可能不是 。您在“底部的视觉趋势线”下方有异常点。这些将如何被“忽略”以捕捉眼睛看到的主要地板趋势线并且不受它们的影响。为了检测这些趋势并减少它们的影响,我们希望同时检测趋势线和与趋势不一致的脉冲。如果您可以按照您所说的减少 xy 观察值,然后发布减少的集合,我可能会使用我所知道的唯一一款在考虑 ARIMA 结构和脉冲的同时处理趋势检测的商用软件。据我所知,没有免费的东西可用,当然复制人眼的启发式方法是不可公开的。

也许尝试用 R 语言运行这个定制的代码。如果你从未使用过 R,请查看http://twotorials.com/开始;)

# create eight sets of a thousand random x values
x1 <- rnorm( 1000 , mean = 1 )
x2 <- rnorm( 1000 , mean = 2 )
x3 <- rnorm( 1000 , mean = 3 )
x4 <- rnorm( 1000 , mean = 4 )
x5 <- rnorm( 1000 , mean = 5 )
x6 <- rnorm( 1000 , mean = 1 )
x7 <- rnorm( 1000 , mean = 1 )
x8 <- rnorm( 1000 , mean = 3 )

# create eight sets of a thousand random y values
y1 <- rnorm( 1000 , mean = 1 )
y2 <- rnorm( 1000 , mean = 2 )
y3 <- rnorm( 1000 , mean = 3 )
y4 <- rnorm( 1000 , mean = 4 )
y5 <- rnorm( 1000 , mean = 5 )
y6 <- rnorm( 1000 , mean = 5 )
y7 <- rnorm( 1000 , mean = 3 )
y8 <- rnorm( 1000 , mean = 5 )

# combine all of these values into two vectors
x <- c( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 )
y <- c( y1 , y2 , y3, y4 , y5 , y6 , y7 , y8 )

# this distribution looks like your example distribution
plot( x , y )

# along the x axis, figure out some reasonable intervals to "bin" the data

# let's say you want one bin per 100 points..
num.bins <- length( x ) / 100

# figure out what quantiles to cut your bins at
quantile.probs <- seq( 0 , 1 , length.out = num.bins )

# slice up your `x` data into that many equal bins
bin.cutpoints <- quantile( x , quantile.probs )

# now let's look at just the first bin #

# positions within the first bin
first.bin <- which( bin.cutpoints [ 1 ] <= x & x < bin.cutpoints [ 2 ] )

# x midpoint between first two cutpoints
first.midpoint <- as.numeric( bin.cutpoints [ 1 ] + ( bin.cutpoints [ 2 ] - bin.cutpoints [ 1 ] ) / 2 )

first.midpoint

# since you wanted to discard 1% of all points, choose the 1% quantile cutoff point
one.percent.cutoff <- round( quantile( 1:length( y[ first.bin ] ) , 0.01 ) )

# find the point at the edge of the first percentile within this bin
first.percentile <- sort( y[ first.bin ] )[ one.percent.cutoff ]

# and there's your `y` value
first.percentile


# create two empty vectors to start storing values
low.x <- NULL
low.y <- NULL

# repeat this process for all bins:
for ( i in 2:length( bin.cutpoints ) ){

    this.bin <- which( bin.cutpoints [ i - 1 ] <= x & x < bin.cutpoints [ i ] )

    this.midpoint <- as.numeric( bin.cutpoints [ i ] + ( bin.cutpoints [ i ] - bin.cutpoints [ i ] ) / 2 )

    low.x <- c( low.x , this.midpoint )

    # since you wanted to discard 1% of all points, choose the 1% quantile cutoff point
    one.percent.cutoff <- round( quantile( 1:length( y[ this.bin ] ) , 0.01 ) )

    # find the point at the edge of the first percentile within this bin
    first.percentile <- sort( y[ this.bin ] )[ one.percent.cutoff ]

    # and there's your `y` value
    first.percentile

    low.y <- c( low.y , first.percentile )
}

# plot your original points
plot( x , y , main = 'one bin per hundred points' )

# RE-plot the points that had the second-lowest y value within each "bin"
# so you can see exactly what line you're best-fitting
points( low.x , low.y , col = "red" , pch = 19 )

# draw your line of best fit
abline( lm( low.y ~ low.x ) )

这是结果.. 在此处输入图像描述

请注意,它对每个 bin 的大小很敏感。

在此处输入图像描述