我接受了自己的建议,并用一些模拟数据进行了尝试。这不是一个完整的答案,但它是我硬盘上的内容。
我模拟数据...
...带有重尾残差(ϵ10∼Student(ν=1))。
如果使用 拟合两个模型brms
,一个带有高斯残差,一个带有学生残差。
m_gauss = brm(y ~ x + (x|subject), data=data, file='fits/m_gauss')
m_student = brm(y ~ x + (x|subject), data=data, family=student,
file='fits/m_student')
在高斯模型中,固定效应估计是合理的,但有噪声,(见下面模拟代码中的真实参数),但是sigma
,残差的标准差估计为 60 左右。
summary(m_gauss)
# Family: gaussian
# Links: mu = identity; sigma = identity
# Formula: y ~ x + (x | subject)
# Data: data (Number of observations: 100)
# Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
# total post-warmup samples = 4000
#
# Group-Level Effects:
# ~subject (Number of levels: 5)
# Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sd(Intercept) 67.72 21.63 38.27 123.54 1.00 1445 2154
# sd(x) 21.72 7.33 11.58 40.11 1.00 1477 2117
# cor(Intercept,x) -0.18 0.32 -0.73 0.48 1.00 1608 1368
#
# Population-Level Effects:
# Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# Intercept 2.90 19.65 -35.38 43.25 1.00 1959 2204
# x -2.63 8.91 -19.36 16.19 1.00 1659 1678
#
# Family Specific Parameters:
# Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma 67.02 5.08 57.84 77.94 1.00 3790 3177
Student 模型给出了一致的参数估计,因为残差或多或少是对称的,但标准误差降低了。不过,我对标准误差的实际变化如此之小感到惊讶。它正确估计σ(10) 和ν(自由度:1)为残差。
summary(m_student)
# Family: student
# Links: mu = identity; sigma = identity; nu = identity
# Formula: y ~ x + (x | subject)
# Data: data (Number of observations: 100)
# Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
# total post-warmup samples = 4000
#
# Group-Level Effects:
# ~subject (Number of levels: 5)
# Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sd(Intercept) 57.54 18.26 33.76 103.21 1.00 1069 1677
# sd(x) 22.99 8.29 12.19 43.89 1.00 1292 1302
# cor(Intercept,x) -0.20 0.31 -0.73 0.45 1.00 2532 2419
#
# Population-Level Effects:
# Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# Intercept 2.26 17.61 -32.62 39.11 1.00 1733 1873
# x -3.42 9.12 -20.50 15.38 1.00 1641 1263
#
# Family Specific Parameters:
# Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma 10.46 1.75 7.34 14.23 1.00 2473 2692
# nu 1.24 0.18 1.01 1.68 1.00 2213 1412
有趣的是,固定效应和随机效应的标准误差的减少导致后验预测均值精度的显着提高(即回归线位置的不确定性,如下所示)。
因此,当您的数据具有重尾残差时,使用假设高斯残差的模型会夸大您的固定效应以及其他参数的标准误差。这主要是因为如果假设残差为高斯分布,则它们必须来自具有巨大标准偏差的高斯分布,因此数据被视为非常嘈杂。
使用正确指定高斯重尾性质的模型(这也是非贝叶斯稳健回归所做的)在很大程度上解决了这个问题,即使单个参数的标准误差变化不大,累积效应在估计的回归线上是相当可观的!
方差齐性
值得注意的是,即使所有残差都来自同一个分布,重尾意味着一些组会有很多异常值(例如第 4 组),而另一些则不会(例如第 2 组)。这里的两个模型都假设残差在每组中具有相同的方差。这给高斯模型带来了额外的问题,因为它被迫得出结论,即使是数据接近回归线的第 2 组,也具有较高的残差方差,因此具有更大的不确定性。换句话说,如果没有使用稳健的重尾残差分布正确建模某些组中存在的异常值,即使没有异常值的组也会增加不确定性。
代码
library(tidyverse)
library(brms)
dir.create('fits')
theme_set(theme_classic(base_size = 18))
# Simulate some data
n_subj = 5
n_trials = 20
subj_intercepts = rnorm(n_subj, 0, 50) # Varying intercepts
subj_slopes = rnorm(n_subj, 0, 20) # Varying slopes
data = data.frame(subject = rep(1:n_subj, each=n_trials),
intercept = rep(subj_intercepts, each=n_trials),
slope = rep(subj_slopes, each=n_trials)) %>%
mutate(
x = rnorm(n(), 0, 10),
yhat = intercept + x*slope)
residuals = rt(nrow(data), df=1) * 10
hist(residuals, breaks = 50)
data$y = data$yhat + residuals
ggplot(data, aes(x, y, color=factor(subject))) +
geom_point() +
stat_smooth(method='lm', se=T) +
labs(x='x', y='y', color='Group') +
geom_hline(linetype='dashed', yintercept=0)
m_gauss = brm(y ~ x + (x|subject), data=data, file='fits/m_gauss')
m_student = brm(y ~ x + (x|subject), data=data,
family=student, file='fits/m_student')
summary(m_gauss)
summary(m_student)
fixef(m_gauss)
fixef(m_student)
pred_gauss = data.frame(fitted(m_gauss))
names(pred_gauss) = paste0('gauss_', c('b', 'se', 'low', 'high'))
pred_student = data.frame(fitted(m_student))
names(pred_student) = paste0('student_', c('b', 'se', 'low', 'high'))
pred_df = cbind(data, pred_gauss, pred_student) %>%
arrange(subject, x)
ggplot(pred_df, aes(x, gauss_b,
ymin=gauss_low, ymax=gauss_high,
color=factor(subject),
fill=factor(subject))) +
geom_path() + geom_ribbon(alpha=.2) +
labs(title='Gaussian Model', color='Subject', fill='Subject',
y='Estimates')
ggplot(pred_df, aes(x, student_b,
ymin=student_low, ymax=student_high,
color=factor(subject),
fill=factor(subject))) +
geom_path() + geom_ribbon(alpha=.2) +
labs(title='Heavy-tailed (Student) Model', color='Subject', fill='Subject',
y='Estimates')