机器算法验证 - 自举是否适合使用小样本量估计多元正态协方差矩阵？ - 吾爱随笔录

自举是否适合使用小样本量估计多元正态协方差矩阵？

机器算法验证 r 数理统计引导程序协方差

2022-04-04 06:23:00

设是一个矩阵，其中每一行表示对 4 变量正态分布随机变量的观察。 $p$ $\mathcal{N}_4(\mu,\Sigma)$

是否有任何 Bootstrap 方法可以很好地估计 Σ？
如果不是，以下样本的数量是否足以引导任何统计量 T 的分布，该统计量对 Y 来自的总体进行操作？

这是我到目前为止所尝试的：

Y<-data.frame(response=c(10,19,27,28,9,13,25,29,4,10,20,18,5,6,12,17),
               treatment=factor(rep(1:4,4)),
               subject=factor(rep(1:4,each=4))
               )

p<-matrix(Y$response,4,4,byrow=T)
B<-1000
sampleB<-sample(1:4,4*B,replace=T)
fit<-lm(p[sampleB,]~1)
cov(residuals(fit))

我也试过

require(nlme)
require(mgcv)
nSubj <- 20
sampleB<-sample(1:4,nSubj,replace=T)
y<-data.frame(response=c(t(p[sampleB,])),
           treatment=factor(rep(1:4,nSubj)),
           subject=factor(rep(1:nSubj,each=4))
           )

fit <- lme(response~-1+treatment,y,random=~1|subject,correlation=corSymm())
extract.lme.cov(fit,y)[1:4,1:4]

但我得到错误代码：

Error in lme.formula(response ~ -1 + treatment, y, random = ~1 | subject,  : 
nlminb problem, convergence error code = 1
message = iteration limit reached without convergence (10)

1个回答

的好选择吗？ $\Sigma$

不，boostrap 将帮助您推断样本估计的不确定性。具体来说，它可用于获取元素的置信区间。 $\widehat{\Sigma}$

引导程序如何工作？

该方法是通过对带有替换的观察结果进行重新采样，从原始数据集创建然后你计算每个复制的兴趣估计，在你的情况下，协方差矩阵，对于每个复制，获得。的置信区间可以根据经验从中计算出来。 $R$ $R$ $R$ $\widehat{\Sigma}^1, \ldots, \widehat{\Sigma}^R$ $\widehat{\Sigma}$ $\widehat{\Sigma}^1, \ldots, \widehat{\Sigma}^R$

有关更多信息，您可能需要查看Wikipedia 页面。

编辑

我可以使用而不是吗？ $\widetilde{\Sigma}_R = R^{-1} \sum_{r=1}^R \widehat{\Sigma}^r$ $\widehat{\Sigma}$

实际上，矩阵是对的估计，其中是基于初始样本的引导复制。自举归结为经验分布的样本。因此，我认为，但我没有正式的证明，即 as。 $\widetilde{\Sigma}_R$ ${\rm E} (\widehat{\Sigma}^r)$ $\widehat{\Sigma}^r$ $\widehat{F}$ $\widetilde{\Sigma}_R = {\rm E} (\widehat{\Sigma}^r) \to \widehat{\Sigma}$ $R \to \infty$

所以，我认为你可以使用 $\widetilde{\Sigma}_R$ 代替 $\widehat{\Sigma}$ ，但这就像用大锤敲碎坚果一样。

下面的 R 代码是一个数值研究，它不能证明上述关于收敛的断言。依赖结构是 Gumbel copula，边距是两个标准正态分布。

## Initialization
library(copula)
set.seed(531)
n <- 200            # Number of observations in the original sample
R <- 10000          # Number of replications
## Specification for the dependence structure (Gumbel copula)
spec.cl <- archmCopula("gumbel", 1.2)
## Create a fake original dataset
pseudo  <- rCopula(n, spec.cl)
obs     <- qnorm(pseudo)
cov.obs <- cov(obs)[1, 2]
## Get an idea of the "true" covariance
pseudo   <- rCopula(10000, spec.cl)
obs.big  <- qnorm(pseudo)
cov.true <- cov(obs.big)[1, 2]
## Get the bootstrap covariances
cov.sim <- sapply(1:R,
                  function(i, x, n){x.boot <- x[sample(1:n, size = n, replace = TRUE), ]
                                    cov(x.boot)[1, 2]},
                  x = obs, n = n)
## Visualization
plot(1:R, cov.sim, xlab = "Replication", ylab = "", pch = 16, cex = 0.7, col ="grey",
     ylim = quantile(cov.sim, probs = c(0.1, 0.9)))
lines(1:R, rep(cov.true, R), col = "green", lwd = 2)
lines(1:R, rep(cov.obs, R), col = "red", lwd = 2)
lines(1:R, cumsum(cov.sim)/(1:R), col = "blue", lwd = 2)
legend("topright", legend = c("Boot cov", "True", "Initial", "Boot average"),
       col = c("grey", "green", "red", "blue"),
       bg = "white", pch = c(16, NA, NA, NA), lwd = c(NA, 2, 2, 2))

蓝线对应 $\widetilde{\Sigma}_R$ 作为复制次数的函数，红线是 $\widehat{\Sigma}$ .

在此处输入图像描述

其它你可能感兴趣的问题

上一篇数据严重偏向零时的两个样本位置测试？下一篇为什么系数大小的正则化改善了线性回归的泛化？