机器算法验证 - 当每个点在两者中都有自己的不确定性时的回归Xx和是的y - 吾爱随笔录

当每个点在两者中都有自己的不确定性时的回归Xx和是的y

机器算法验证 r 回归德明回归

2022-02-07 13:59:54

我做了 $n$ 两个变量的测量 $x$ 和 $y$ . 他们都知道不确定性 $\sigma_x$ 和 $\sigma_y$ 与他们相关联。我想找到之间的关系 $x$ 和 $y$ . 我该怎么做？

编辑：每个 $x_i$ 有不同的 $\sigma_{x,i}$ 与它相关联，与 $y_i$ .

可重现的 R 示例：

## pick some real x and y values 
true_x <- 1:100
true_y <- 2*true_x+1

## pick the uncertainty on them
sigma_x <- runif(length(true_x), 1, 10) # 10
sigma_y <- runif(length(true_y), 1, 15) # 15

## perturb both x and y with noise 
noisy_x <- rnorm(length(true_x), true_x, sigma_x)
noisy_y <- rnorm(length(true_y), true_y, sigma_y)

## make a plot 
plot(NA, xlab="x", ylab="y",
    xlim=range(noisy_x-sigma_x, noisy_x+sigma_x), 
    ylim=range(noisy_y-sigma_y, noisy_y+sigma_y))
arrows(noisy_x, noisy_y-sigma_y, 
       noisy_x, noisy_y+sigma_y, 
       length=0, angle=90, code=3, col="darkgray")
arrows(noisy_x-sigma_x, noisy_y,
       noisy_x+sigma_x, noisy_y,
       length=0, angle=90, code=3, col="darkgray")
points(noisy_y ~ noisy_x)

## fit a line 
mdl <- lm(noisy_y ~ noisy_x)
abline(mdl)

## show confidence interval around line 
newXs <- seq(-100, 200, 1)
prd <- predict(mdl, newdata=data.frame(noisy_x=newXs), 
    interval=c('confidence'), level=0.99, type='response')
lines(newXs, prd[,2], col='black', lty=3)
lines(newXs, prd[,3], col='black', lty=3)

这个例子的问题是我认为它假设没有不确定性 $x$ . 我怎样才能解决这个问题？

2个回答

让真线 $L$ , 由角度给出 $\theta$ 和一个值 $\gamma$ , 成为集合

(x, y) : \cos (θ) x + \sin (θ) y = γ .

$(x,y): \cos(\theta) x + \sin(\theta) y = \gamma.$

任意点之间的有符号距离 $(x,y)$ 这条线是

d (x, y; L) = \cos (θ) x + \sin (θ) y - γ .

$d(x,y;L) = \cos(\theta) x + \sin(\theta) y - \gamma.$

让方差为 $x_i$ 是 $\sigma_i^2$ 和那个 $y_i$ 是 $\tau_i^2$ , 独立性 $x_i$ 和 $y_i$ 意味着这个距离的方差是

Var (d (x_{i}, y_{i}; L)) = \cos^{2} (θ) σ_{i}^{2} + \sin^{2} (θ) τ_{i}^{2} .

$\operatorname{Var}(d(x_i,y_i;L)) = \cos^2(\theta)\sigma_i^2 + \sin^2(\theta)\tau_i^2.$

因此让我们找到 $\theta$ 和 $\gamma$ 其中方差加权平方和尽可能小：如果我们假设误差具有二元正态分布，它将是最大似然解。这需要一个数值解，但很容易通过几个 Newton-Raphson 步骤找到一个以普通最小二乘拟合建议的值开头的 a。

模拟表明，即使数据量较小且 $\sigma_i$ 和 $\tau_i$ . 当然，您可以通过通常的方式获得参数的标准误差。如果您对直线位置的标准误差以及斜率感兴趣，那么您可能希望首先将两个变量居中 $0$ ：这应该消除两个参数估计之间的几乎所有相关性。

该方法与问题示例的效果非常好，以至于拟合线几乎可以与图中的真实线区分开来：它们在任何地方都在一个单位左右的范围内。相反，在这个例子中 $\tau_i$ 是从指数分布中绘制的，并且 $\sigma_i$ 是从具有两倍尺度的指数分布中绘制出来的（因此大部分误差往往发生在 $x$ 协调）。只有 $n=8$ 分，数量不多。真实点沿线等距分布，单位间距。这是一个相当严格的测试，因为与点的范围相比，潜在的错误是显而易见的。

真线以蓝色虚线显示。沿着它，原始点被绘制为空心圆圈。灰色箭头将它们连接到观察点，绘制为实心黑色圆盘。解决方案绘制为红色实线。尽管观测值和实际值之间存在很大偏差，但该解决方案非常接近该区域内的正确线。

#
# Generate data.
#
theta <- c(1, -2, 3) # The line is theta %*% c(x,y,-1) == 0
theta[-3] <- theta[-3]/sqrt(crossprod(theta[-3]))
n <- 8
set.seed(17)
sigma <- rexp(n, 1/2)
tau <- rexp(n, 1)
u <- 1:n
xy.0 <- t(outer(c(-theta[2], theta[1]), 0:(n-1)) + c(theta[3]/theta[1], 0))
xy <- xy.0 + cbind(rnorm(n, sd=sigma), rnorm(n, sd=tau))
#
# Fit a line.
#
x <- xy[, 1]
y <- xy[, 2]
f <- function(phi) { # Negative log likelihood, up to an additive constant
  a <- phi[1]
  gamma <- phi[2]
  sum((x*cos(a) + y*sin(a) - gamma)^2 / ((sigma*cos(a))^2 + (tau*sin(a))^2))/2
}
fit <- lm(y ~ x) # Yields starting estimates
slope <- coef(fit)[2]
theta.0 <- atan2(1, -slope)
gamma.0 <- coef(fit)[1] / sqrt(1 + slope^2)
sol <- nlm(f,c(theta.0, gamma.0))
#
# Plot the data and the fit.
#
theta.hat <- sol$estimate[1] %% (2*pi)
gamma.hat <- sol$estimate[2]
plot(rbind(xy.0, xy), type="n", xlab="x", ylab="y")
invisible(sapply(1:n, function(i) 
  arrows(xy.0[i,1], xy.0[i,2], xy[i,1], xy[i,2], 
         length=0.15, angle=20, col="Gray")))
points(xy.0)
points(xy, pch=16)
abline(c(theta[3] / theta[2], -theta[1]/theta[2]), col="Blue", lwd=2, lty=3)
abline(c(gamma.hat / sin(theta.hat), -1/tan(theta.hat)), col="Red", lwd=2)

York (2004) 已经解决了 x 和 y 不确定情况的最大似然优化问题。这是他的功能的R代码。

“YorkFit”，由 Rick Wehr 撰写，2011 年，由 Rachel Chang 翻译成 R

通用例程，用于查找具有可变相关误差（包括误差和拟合优度估计）的数据的最佳直线拟合，遵循方程式。(13) York 2004, American Journal of Physics，该期刊又基于 York 1969, Earth and Planetary Sciences Letters

YorkFit <- 函数（X，Y，Xstd，Ystd，Ri=0，b0=0，printCoefs=0，makeLine=0，eps=1e-7）

X、Y、Xstd、Ystd：包含 X 点、Y 点及其标准差的波

警告：Xstd 和 Ystd 不能为零，因为这将导致 Xw 或 Yw 为 NaN。请改用非常小的值。

Ri：X 和 Y 误差的相关系数——长度 1 或 X 和 Y 的长度

b0：斜率的粗略初始猜测（可以从标准最小二乘拟合中得到，没有错误）

printCoefs：设置等于 1 以在命令窗口中显示结果

makeLine：设置等于 1 为拟合线生成 Y 波

返回一个包含截距和斜率以及它们的不确定性的矩阵

如果没有提供 b0 的初始猜测，则只需使用 OLS if (b0 == 0) {b0 = lm(Y~X)$coefficients[2]}

tol = abs(b0)*eps #the fit will stop iterating when the slope converges to within this value

a,b: 最终截距和斜率 a.err, b.err: 截距和斜率的估计不确定性

# WAVE DEFINITIONS #

Xw = 1/(Xstd^2) #X weights
Yw = 1/(Ystd^2) #Y weights


# ITERATIVE CALCULATION OF SLOPE AND INTERCEPT #

b = b0
b.diff = tol + 1
while(b.diff>tol)
{
    b.old = b
    alpha.i = sqrt(Xw*Yw)
    Wi = (Xw*Yw)/((b^2)*Yw + Xw - 2*b*Ri*alpha.i)
    WiX = Wi*X
    WiY = Wi*Y
    sumWiX = sum(WiX, na.rm = TRUE)
    sumWiY = sum(WiY, na.rm = TRUE)
    sumWi = sum(Wi, na.rm = TRUE)
    Xbar = sumWiX/sumWi
    Ybar = sumWiY/sumWi
    Ui = X - Xbar
    Vi = Y - Ybar

    Bi = Wi*((Ui/Yw) + (b*Vi/Xw) - (b*Ui+Vi)*Ri/alpha.i)
    wTOPint = Bi*Wi*Vi
    wBOTint = Bi*Wi*Ui
    sumTOP = sum(wTOPint, na.rm=TRUE)
    sumBOT = sum(wBOTint, na.rm=TRUE)
    b = sumTOP/sumBOT

    b.diff = abs(b-b.old)
  }     

   a = Ybar - b*Xbar
   wYorkFitCoefs = c(a,b)

# ERROR CALCULATION #

Xadj = Xbar + Bi
WiXadj = Wi*Xadj
sumWiXadj = sum(WiXadj, na.rm=TRUE)
Xadjbar = sumWiXadj/sumWi
Uadj = Xadj - Xadjbar
wErrorTerm = Wi*Uadj*Uadj
errorSum = sum(wErrorTerm, na.rm=TRUE)
b.err = sqrt(1/errorSum)
a.err = sqrt((1/sumWi) + (Xadjbar^2)*(b.err^2))
wYorkFitErrors = c(a.err,b.err)

# GOODNESS OF FIT CALCULATION #
lgth = length(X)
wSint = Wi*(Y - b*X - a)^2
sumSint = sum(wSint, na.rm=TRUE)
wYorkGOF = c(sumSint/(lgth-2),sqrt(2/(lgth-2))) #GOF (should equal 1 if assumptions are valid), #standard error in GOF

# OPTIONAL OUTPUTS #

if(printCoefs==1)
 {
    print(paste("intercept = ", a, " +/- ", a.err, sep=""))
    print(paste("slope = ", b, " +/- ", b.err, sep=""))
  }
if(makeLine==1)
 {
    wYorkFitLine = a + b*X
  }
 ans=rbind(c(a,a.err),c(b, b.err)); dimnames(ans)=list(c("Int","Slope"),c("Value","Sigma"))
return(ans)
 }

其它你可能感兴趣的问题

上一篇ACF 函数的置信区间是如何计算的？下一篇神经网络（例如，卷积神经网络）可以有负权重吗？