我经常看到人们(统计学家和从业者)不假思索地转换变量。我一直害怕转换,因为我担心它们会修改错误分布,从而导致无效的推理,但它很常见,以至于我不得不误解一些东西。
为了修正想法,假设我有一个模型
原则上这可能适合 NLS。然而,我几乎总是看到人们记录日志,并安装
我知道这可以通过 OLS 拟合,但我现在不知道如何计算参数的置信区间,更不用说预测区间或公差区间了。
这是一个非常简单的案例:考虑这个相当复杂的(对我而言)案例,其中我不假设和先验关系的形式,但我尝试从数据中推断它,例如,一个GAM。让我们考虑以下数据:
library(readr)
library(dplyr)
library(ggplot2)
# data
device <- structure(list(Amplification = c(1.00644, 1.00861, 1.00936, 1.00944,
1.01111, 1.01291, 1.01369, 1.01552, 1.01963, 1.02396, 1.03016,
1.03911, 1.04861, 1.0753, 1.11572, 1.1728, 1.2512, 1.35919, 1.50447,
1.69446, 1.94737, 2.26728, 2.66248, 3.14672, 3.74638, 4.48604,
5.40735, 6.56322, 8.01865, 9.8788, 12.2692, 15.3878, 19.535,
20.5192, 21.5678, 22.6852, 23.8745, 25.1438, 26.5022, 27.9537,
29.5101, 31.184, 32.9943, 34.9456, 37.0535, 39.325, 41.7975,
44.5037, 47.466, 50.7181, 54.2794, 58.2247, 62.6346, 67.5392,
73.0477, 79.2657, 86.3285, 94.4213, 103.781, 114.723, 127.637,
143.129, 162.01, 185.551, 215.704, 255.635, 310.876, 392.231,
523.313, 768.967, 1388.19, 4882.47), Voltage = c(34.7732, 24.7936,
39.7788, 44.7776, 49.7758, 54.7784, 64.778, 74.775, 79.7739,
84.7738, 89.7723, 94.772, 99.772, 109.774, 119.777, 129.784,
139.789, 149.79, 159.784, 169.772, 179.758, 189.749, 199.743,
209.736, 219.749, 229.755, 239.762, 249.766, 259.771, 269.775,
279.778, 289.781, 299.783, 301.783, 303.782, 305.781, 307.781,
309.781, 311.781, 313.781, 315.78, 317.781, 319.78, 321.78, 323.78,
325.78, 327.779, 329.78, 331.78, 333.781, 335.773, 337.774, 339.781,
341.783, 343.783, 345.783, 347.783, 349.785, 351.785, 353.786,
355.786, 357.787, 359.786, 361.787, 363.787, 365.788, 367.79,
369.792, 371.792, 373.794, 375.797, 377.8)), .Names = c("Amplification",
"Voltage"), row.names = c(NA, -72L), class = "data.frame")
的情况下绘制数据,则生成的模型和置信范围看起来不太好:
# build model
model <- gam(Voltage ~ s(Amplification, sp = 0.001), data = device)
# compute predictions with standard errors and rename columns to make plotting simpler
Amplifications <- data.frame(Amplification = seq(min(APD_data$Amplification),
max(APD_data$Amplification), length.out = 500))
predictions <- predict.gam(model, Amplifications, se.fit = TRUE)
predictions <- cbind(Amplifications, predictions)
predictions <- rename(predictions, Voltage = fit)
# plot data, model and standard errors
ggplot(device, aes(x = Amplification, y = Voltage)) +
geom_point() +
geom_ribbon(data = predictions,
aes(ymin = Voltage - 1.96*se.fit, ymax = Voltage + 1.96*se.fit),
fill = "grey70", alpha = 0.5) +
geom_line(data = predictions, color = "blue")
但是,如果我只对上的置信范围似乎变得更小:
log_model <- gam(Voltage ~ s(log(Amplification)), data = device)
# the rest of the code stays the same, except for log_model in place of model
很明显,有些可疑的事情正在发生。这些置信区间可靠吗? 编辑这不仅仅是平滑程度的问题,正如答案中所建议的那样。没有对数变换,平滑参数为
> model$sp
s(Amplification)
5.03049e-07
使用对数变换,平滑参数确实要大得多:
>log_model$sp
s(log(Amplification))
0.0005156608
但这不是置信区间变得如此小的原因。事实上,使用更大的平滑参数sp = 0.001
,但避免任何对数变换,可以减少振荡(如在对数变换的情况下),但相对于对数变换的情况,标准误差仍然很大:
smooth_model <- gam(Voltage ~ s(Amplification, sp = 0.001), data = device)
# the rest of the code stays the same, except for smooth_model in place of model
一般来说,如果我记录变换和/或,置信区间会发生什么?如果在一般情况下无法定量回答,我将接受第一种情况(指数模型)的定量答案(即,它显示一个公式),并为第二种情况至少给出一个挥手的论据(GAM 模型)。