机器算法验证 - 为什么线性混合模型中观察值的增加会导致贝叶斯建模方法失败？ - 吾爱随笔录

我对贝叶斯建模背后的理论有相当好的理解，并且我已经开始尝试使用jagsR 进行一些实际的建模。我一直在关注在线示例以习惯语法。然而在下面的例子中，我只是增加了观察的数量，它产生了意想不到的效果，导致结果出现偏差。我试图找出原因。

我一直在使用这里的资源。在第三个示例中，他们尝试创建一个线性混合模型：b_mass ~ b_length + r.e(site). （一条蛇的）体重与连续可变长度呈线性关系，并且在发现蛇的位置上存在随机截距。

问题：
1.为什么当观察次数增加时，贝叶斯模型表现如此糟糕？

2.在混合模型中，观察值与随机效应数量的比率是否重要？在常客框架下拟合标准线性混合效应模型时似乎并非如此。

3.带有大量异常值/长尾的迹线图表示什么？又应该对他们进行哪些调整？

细节：

场景（A）：
生成数据并拟合模型的代码，直接取自上面的网站：

## First create the data
set.seed(42)

samplesize <- 200
nsites <- 10 # Number of sites
b_length <- sort(rnorm(samplesize)) # Body length (explanatory variable)
sites <- sample(1:10, samplesize, replace = T) # Site (grouping variable)
table(sites)

int_true_mean <- 45 # True mean intercept
int_true_sigma <- 10 # True standard deviation of random intercepts
int_true_sites <- rnorm(n = nsites, mean = int_true_mean, sd = int_true_sigma) # True intercept of each site

# Intercept of each snake individual (depending on the site where it was captured):
sitemat <- matrix(0, nrow = samplesize, ncol = nsites)
for (i in 1:nrow(sitemat)) sitemat[i, sites[i]] <- 1
int_true <- sitemat %*% int_true_sites

slope_true <- 10 # True slope (coefficient of length)
mu <- int_true + slope_true * b_length # True means of normal distributions
sigma <- 5 # True standard deviation of normal distribution (random variation)

b_mass <- rnorm(samplesize, mean = mu, sd = sigma) # Body mass (response variable)

snakes3 <- data.frame(b_length = b_length, b_mass = b_mass, site = sites)
head(snakes3)

## Create list for jags to work on, include number of sites as an additional element in the list:
Nsites <- length(levels(as.factor(snakes3$site)))
jagsdata_s3 <- with(snakes3, list(b_mass = b_mass, b_length = b_length, site = site, N = length(b_mass), Nsites = Nsites))

## Define model

lm3_jags <- function(){
  # Likelihood:
  for (i in 1:N){
    b_mass[i] ~ dnorm(mu[i], tau) # tau is precision (1 / variance)
    mu[i] <- alpha + a[site[i]] + beta * b_length[i] # Random intercept for     site
  }
  # Priors:
  alpha ~ dnorm(0, 0.01) # intercept
  sigma_a ~ dunif(0, 100) # standard deviation of random effect (variance between sites)
  tau_a <- 1 / (sigma_a * sigma_a) # convert to precision
  for (j in 1:Nsites){
    a[j] ~ dnorm(0, tau_a) # random intercept for each site
  }
  beta ~ dnorm(0, 0.01) # slope
  sigma ~ dunif(0, 100) # standard deviation of fixed effect (variance within sites)
  tau <- 1 / (sigma * sigma) # convert to precision
}

## Define initial values
init_values <- function(){
  list(alpha = rnorm(1), sigma_a = runif(1), beta = rnorm(1), sigma = runif(1))
}

# Define parameters to kept in results
params <- c("alpha", "beta", "sigma", "sigma_a")

# Fit the model
fit_lm3 <- jags(data = jagsdata_s3, inits = init_values, parameters.to.save = params, model.file = lm3_jags, n.chains = 3, n.iter = 20000, n.burnin = 5000, n.thin = 2, DIC = F)
# I have reduced n.thin to 2, so output slightly different from website

输出是：

## Output 
fit_lm3


        mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
alpha    42.583   4.481 30.875 41.144 43.476 45.208 48.110 1.004  3600
beta      9.893   0.349  9.208  9.662  9.895 10.127 10.582 1.001 19000
sigma     4.755   0.248  4.300  4.581  4.745  4.915  5.272 1.001 22000
sigma_a   8.803   4.386  4.610  6.292  7.689  9.845 19.898 1.002  2400

正确答案应该是：

alpha（平均截距）= 45。
beta （长度系数）= 10。
sigma （残差 sd）= 5。
sigma_a（截距的 sd）= 10。

结果相当不错，42.583接近45，9.893接近10，4.755接近5，8.803接近10。

场景（B）：
但是我认为获得更准确的结果会很好，所以我将观察次数增加到 1000。

samplesize <- 1000

其余代码保持不变，现在的输出是：

        mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
alpha    33.587   8.982 11.181 28.906 35.602 40.078 45.933 1.001  6100
beta      9.839   0.164  9.519  9.727  9.838  9.950 10.165 1.001 22000
sigma     5.265   0.118  5.038  5.184  5.263  5.343  5.503 1.001 22000
sigma_a  19.635   9.749  8.960 12.965 16.852 23.377 45.792 1.001  5300

长度和残差变异估计的系数仍然很好（beta 和 sigma），但截距变异的估计和截距的平均值现在非常差。

场景（C）：
我认为这可能与#observations/#random 截距的比率有关，所以我将站点数量增加到 50 个，模型再次表现良好。

但是我不确定为什么会这样。

samplesize <- 1000
nsites <- 50 # Number of sites
b_length <- sort(rnorm(samplesize)) # Body length (explanatory variable)
sites <- sample(1:50, samplesize, replace = T) # Site (grouping variable)

输出：

        mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
alpha    43.398   1.533 40.296 42.395 43.420 44.436 46.353 1.001 22000
beta      9.836   0.162  9.522  9.726  9.836  9.945 10.156 1.001 22000
sigma     5.013   0.114  4.794  4.936  5.011  5.089  5.241 1.001 22000
sigma_a  10.816   1.151  8.832 10.009 10.728 11.517 13.362 1.001 22000

我在下面附上了场景(A)的跟踪图。显然，存在一些问题，例如长尾的密度和许多“异常值”值alpha。sigma_a我想发布更多，但我的声誉不够高。场景(B)的轨迹图看起来相似（差），而场景(C)的轨迹图看起来不错。这与结果不符。

最后，我在所有这些场景中拟合了一个标准的线性混合效应模型，并且它在所有三个场景中都表现良好。

以下是场景 A 的跟踪图：

以下是场景 B 的跟踪图：

以下是场景 C 的跟踪图：

以下是样本大小 = 200 且站点数 = 40 时的跟踪图。针对以下 Tommaso 的评论。