机器算法验证 - 记录因变量和/或自变量的对数是否会影响模型误差，从而影响推理的有效性？ - 吾爱随笔录

记录因变量和/或自变量的对数是否会影响模型误差，从而影响推理的有效性？

机器算法验证回归置信区间推理广义加法模型对数

2022-03-28 15:47:37

我经常看到人们（统计学家和从业者）不假思索地转换变量。我一直害怕转换，因为我担心它们会修改错误分布，从而导致无效的推理，但它很常见，以至于我不得不误解一些东西。

为了修正想法，假设我有一个模型

Y = β_{0} \exp X^{β_{1}} + ϵ, ϵ \sim N (0, σ^{2})

$Y=\beta_0\exp X^{\beta_1}+\epsilon,\ \epsilon\sim\mathcal{N}(0,\sigma^2)$

原则上这可能适合 NLS。然而，我几乎总是看到人们记录日志，并安装

\log Y = \log β_{0} + β_{1} \log X + ? ? ? \Rightarrow Z = α_{0} + β_{1} W + ? ? ?

$\log{Y}=\log\beta_0+\beta_1\log{X}+???\Rightarrow Z=\alpha_0+\beta_1W+???$

我知道这可以通过 OLS 拟合，但我现在不知道如何计算参数的置信区间，更不用说预测区间或公差区间了。

这是一个非常简单的案例：考虑这个相当复杂的（对我而言）案例，其中我不假设和先验关系的形式，但我尝试从数据中推断它，例如，一个GAM。让我们考虑以下数据： $Y$ $X$

library(readr)
library(dplyr)
library(ggplot2)

# data
device <- structure(list(Amplification = c(1.00644, 1.00861, 1.00936, 1.00944, 
1.01111, 1.01291, 1.01369, 1.01552, 1.01963, 1.02396, 1.03016, 
1.03911, 1.04861, 1.0753, 1.11572, 1.1728, 1.2512, 1.35919, 1.50447, 
1.69446, 1.94737, 2.26728, 2.66248, 3.14672, 3.74638, 4.48604, 
5.40735, 6.56322, 8.01865, 9.8788, 12.2692, 15.3878, 19.535, 
20.5192, 21.5678, 22.6852, 23.8745, 25.1438, 26.5022, 27.9537, 
29.5101, 31.184, 32.9943, 34.9456, 37.0535, 39.325, 41.7975, 
44.5037, 47.466, 50.7181, 54.2794, 58.2247, 62.6346, 67.5392, 
73.0477, 79.2657, 86.3285, 94.4213, 103.781, 114.723, 127.637, 
143.129, 162.01, 185.551, 215.704, 255.635, 310.876, 392.231, 
523.313, 768.967, 1388.19, 4882.47), Voltage = c(34.7732, 24.7936, 
39.7788, 44.7776, 49.7758, 54.7784, 64.778, 74.775, 79.7739, 
84.7738, 89.7723, 94.772, 99.772, 109.774, 119.777, 129.784, 
139.789, 149.79, 159.784, 169.772, 179.758, 189.749, 199.743, 
209.736, 219.749, 229.755, 239.762, 249.766, 259.771, 269.775, 
279.778, 289.781, 299.783, 301.783, 303.782, 305.781, 307.781, 
309.781, 311.781, 313.781, 315.78, 317.781, 319.78, 321.78, 323.78, 
325.78, 327.779, 329.78, 331.78, 333.781, 335.773, 337.774, 339.781, 
341.783, 343.783, 345.783, 347.783, 349.785, 351.785, 353.786, 
355.786, 357.787, 359.786, 361.787, 363.787, 365.788, 367.79, 
369.792, 371.792, 373.794, 375.797, 377.8)), .Names = c("Amplification", 
"Voltage"), row.names = c(NA, -72L), class = "data.frame")

的情况下绘制数据，则生成的模型和置信范围看起来不太好： $X$

# build model
model <- gam(Voltage ~ s(Amplification, sp = 0.001), data = device)

# compute predictions with standard errors and rename columns to make plotting simpler 
Amplifications <- data.frame(Amplification = seq(min(APD_data$Amplification), 
                                           max(APD_data$Amplification), length.out = 500))
predictions <- predict.gam(model, Amplifications, se.fit = TRUE)
predictions <- cbind(Amplifications, predictions)
predictions <- rename(predictions, Voltage = fit) 

# plot data, model and standard errors
ggplot(device, aes(x = Amplification, y = Voltage)) +
  geom_point() +
  geom_ribbon(data = predictions, 
              aes(ymin = Voltage - 1.96*se.fit, ymax = Voltage + 1.96*se.fit), 
              fill = "grey70", alpha = 0.5) +
  geom_line(data = predictions, color = "blue")

但是，如果我只对上的置信范围似乎变得更小： $X$ $Y$

log_model <- gam(Voltage ~ s(log(Amplification)), data = device)
# the rest of the code stays the same, except for log_model in place of model

很明显，有些可疑的事情正在发生。这些置信区间可靠吗？编辑这不仅仅是平滑程度的问题，正如答案中所建议的那样。没有对数变换，平滑参数为

> model$sp
s(Amplification) 
     5.03049e-07

使用对数变换，平滑参数确实要大得多：

>log_model$sp
s(log(Amplification)) 
         0.0005156608

但这不是置信区间变得如此小的原因。事实上，使用更大的平滑参数sp = 0.001，但避免任何对数变换，可以减少振荡（如在对数变换的情况下），但相对于对数变换的情况，标准误差仍然很大：

smooth_model <- gam(Voltage ~ s(Amplification, sp = 0.001), data = device)
# the rest of the code stays the same, except for smooth_model in place of model

一般来说，如果我记录变换和/或，置信区间会发生什么？如果在一般情况下无法定量回答，我将接受第一种情况（指数模型）的定量答案（即，它显示一个公式），并为第二种情况至少给出一个挥手的论据（GAM 模型）。 $X$ $Y$

1个回答

您会看到振铃，这是高频变化（即阶跃函数）通过低通滤波器（即 GAM）的结果。

当你应用对数变换时，你改变了图的近垂直部分的梯度，在左边，使它稍微不那么陡峭，隐含的高频更少，振铃效应消失。

编辑：这里响铃的一些图片：https ://electronics.stackexchange.com/questions/79717/what-can-reduce-overshoot-and-ringing-on-a-simple-square-wave-pulse-generator

Edit2：请注意，增加平滑会增加振铃，因为平滑本质上是导致振铃的低通滤波器。例如，可以减少振铃的方法是 1. 删除左侧上升悬崖边缘的点，然后重新安装，或 2.减少平滑，或 3. 降低频率/增加波长/增加平滑的截止频率.

你可以看到，如果你去掉悬崖边缘位，图的其余部分或多或少是一条直线，那么为什么 GAM 会通过这些点拟合正弦波呢？这完全是因为悬崖边缘部分迫使一个非常高的梯度，然后导致随后的超调。

Edit3：如果是我，我想我会尝试找到一个将图形转换为近似直线的转换。我不太确定那个变换会是什么，但看起来该图非常接近于一条平线，渐近到 ~380 左右。这是比 eg 更强的非线性log，它会变得平坦，但我认为不会那么快。也许像逆sigmoid之类的东西？Sigmoid 是：

y = \frac{1}{1 + \exp (- x)}

$y = \frac{1}{1 + \exp(-x)}$

...看起来像（来自维基百科https://en.wikipedia.org/wiki/Sigmoid_function）

Inverse sigmoid 是logit函数，https ://en.wikipedia.org/wiki/Logit ：

f (x) = \log (\frac{1}{1 - x})

$f(x) = \log \left( \frac{1}{1-x} \right)$

也许与此相关的转换或此的参数化版本可能会使图形更接近于直线，从而更适合标准统计技术？

其它你可能感兴趣的问题

上一篇卡尔曼滤波可以分层完成 - 从具有相同参数的多个时间序列估计？下一篇什么是统计学中的统计学？