我正在尝试从 Elements of Statistical Learning 重新创建图 3.6。关于该图的唯一信息包含在标题中。
要重新创建前向逐步线,我的过程如下:
50次重复:
- 按照描述生成数据
- 应用前向逐步回归(通过 AIC)31 次以添加变量
- 计算每个与其对应的之间的绝对差并存储结果
这给我留下了矩阵,我可以在该矩阵上计算列的平均值以生成绘图。
上述方法是不正确的,但我不清楚它到底应该是什么。我相信我的问题在于对 Y 轴上的均方误差的解释。y轴上的公式到底是什么意思?它只是被比较的第k个beta吗?
参考代码
生成数据:
library('MASS')
library('stats')
library('MLmetrics')
# generate the data
generate_data <- function(r, p, samples){
corr_matrix <- suppressWarnings(matrix(c(1,rep(r,p)), nrow = p, ncol = p)) # ignore warning
mean_vector <- rep(0,p)
data = mvrnorm(n=samples, mu=mean_vector, Sigma=corr_matrix, empirical=TRUE)
coefficients_ <- rnorm(10, mean = 0, sd = 0.4) # 10 non zero coefficients
names(coefficients_) <- paste0('X', 1:10)
data_1 <- t(t(data[,1:10]) * coefficients_) # coefs by first 10 columns
Y <- rowSums(data_1) + rnorm(samples, mean = 0, sd = 6.25) # adding gaussian noise
return(list(data, Y, coefficients_))
}
应用正向逐步回归 50 次:
r <- 0.85
p <- 31
samples <- 300
# forward stepwise
error <- data.frame()
for(i in 1:50){ # i = 50 repititions
output <- generate_data(r, p, samples)
data <- output[[1]]
Y <- output[[2]]
coefficients_ <- output[[3]]
biggest <- formula(lm(Y~., data.frame(data)))
current_model <- 'Y ~ 1'
fit <- lm(as.formula(current_model), data.frame(data))
for(j in 1:31){ # j = 31 variables
# find best variable to add via AIC
new_term <- addterm(fit, scope = biggest)[-1,]
new_var <- row.names(new_term)[min(new_term$AIC) == new_term$AIC]
# add it to the model and fit
current_model <- paste(current_model, '+', new_var)
fit <- lm(as.formula(current_model), data.frame(data))
# jth beta hat
beta_hat <- unname(tail(fit$coefficients, n = 1))
new_var_name <- names(tail(fit$coefficients, n = 1))
# find corresponding beta
if (new_var_name %in% names(coefficients_)){
beta <- coefficients_[new_var_name]
}
else{beta <- 0}
# store difference between the two
diff <- beta_hat - beta
error[i,j] <- diff
}
}
# plot output
vals <-apply(error, 2, function(x) mean(x**2))
plot(vals) # not correct
输出: