机器算法验证 - 为 GAM 构建协方差矩阵 - 吾爱随笔录

为 GAM 构建协方差矩阵

机器算法验证 r 广义加法模型

2022-03-25 02:30:11

我在 R 中使用 GAM 来比较来自 3 个国家/地区的时间序列数据。数据集是一年的每小时测量值。这里的主要目的是显示在一天中的哪个时间和一年中的哪一天，数据与平均序列非常匹配。下面的脚本显示了我的尝试：

## Ozone measurements for three countries in Europe
## Find similarities between time series
require(plyr)
require(lattice)
require(mgcv)

TopFolder <- list("http://www.nilu.no/projects/ccc/onlinedata/ozone/CZ03_2009.dat"
,"http://www.nilu.no/projects/ccc/onlinedata/ozone/CY02_2009.dat"
,"http://www.nilu.no/projects/ccc/onlinedata/ozone/BE35_2009.dat"
)

## create variable for the data 
data = ldply(TopFolder, header = TRUE, read.table, sep = "", skip = 3)

## define Ozone levels
Ozone <- data$Value
Ozone[Ozone==-999] <- NA
Ozone <- data.frame(Ozone)

## define Datetime -  need to concatenate arrays
DateTime <- paste(data$Date,data$Hour, sep = " ")
Date <- as.POSIXct(DateTime, format = "%d.%m.%Y %H:%M")

## define countries 
Countries <- c("Czech","Cyprus","Belgium")
Country <- data.frame(Country = rep(Countries, each = 8760))

## bind together
Dat <- cbind(Ozone, Country = Country)
Dat <- transform(Dat, Doy = as.numeric(format(Date,format = "%j")),
                     Tod = as.numeric(format(Date,format = "%H")),
               DecTime = rep(seq(1,365, length = 8760),by = 3))

## plot the Ozone data
xyplot(Ozone~DecTime | Country, data = Dat, type  = "l", col = 1,
       strip = function(bg = 'white',...)strip.default(bg = 'white',...))

## generalised additive model
mod1 <- gam(Ozone ~ Country + s(Doy,bs = "cc", k = 20) + s(Doy, by = Country, bs = "cc", k = 20) + 
  s(Tod,bs = "cc", k=7) + s(Tod,by = Country,bs = "cr", k=7),
             data = Dat, method = "ML")

plot(mod1,pages=1,scale=0,shade=TRUE)

xyplot(resid(mod1) ~ Doy | Country, data = Dat, type = c("l","smooth"))

mod2 <- gamm(Ozone ~ Country + s(Doy,bs = "cc", k = 20) + s(Doy, by = Country, bs = "cc", k = 20) + 
  s(Tod,bs = "cc", k=7) + s(Tod,by = Country,bs = "cr", k=7),
             data = Dat, method = "ML",correlation = corAR1(form = ~ DecTime | Country))

一个问题是我使用了一个具有相关误差的加法模型，即在每小时测量的每个时间序列中，高浓度之后是另一个高浓度。这可以通过在 GAM 中添加相关矩阵来解决：

但是，R 会抛出一个错误，因为它无法从操作系统获取更多 RAM。因此，为了解决这个问题，我需要： (1) 从 mod1 - 将时间序列模型 (mod3) 拟合到残差 (2) 从 mod3 获得 acf(1) (2) 使用 acf(1) 构建协方差然后我们可以将其传递给 mod2 矩阵，从而简化相关矩阵。

任何有关如何完成上述三个步骤中的部分或全部步骤的建议将不胜感激。虽然我认为我不确定的第一点是如何在 mod1 中提取不同国家的残差？

1个回答

巧合的是，我一直在思考这个问题一段时间，并在前几天才通过电子邮件向 Simon Wood 发送电子邮件。

他对我的建议是（除其他事项外，但我认为当模型中有术语时使用bam()'rho参数不适用by）检查 . 末尾的示例?magic。我在下面用输出重现该示例：

## Now a correlated data example ... 
library(nlme)
## simulate truth
set.seed(1)
n <- 400
sig <- 2
x <- 0:(n-1)/(n-1)
f <- 0.2 * x^11 * (10 * (1-x))^6 + 10 * (10*x)^3 * (1-x)^10
## produce scaled covariance matrix for AR1 errors...
V <- corMatrix(Initialize(corAR1(.6),data.frame(x=x)))
## note here that V is what you would estimate from the model residuals
## but here we are generating the data from known correlation.
## You would plug your estimate of the correlation parameter in to corAR1(X)
##
## for the Cholesky factorisation of V
Cv <- chol(V)  # t(Cv)%*%Cv=V
## Simulate AR1 errors ...
e <- t(Cv) %*% rnorm(n, 0, sig) # so cov(e) = V * sig^2
## Observe truth + AR1 errors
y <- f + e 

## GAM ignoring correlation
b <- gam(y ~ s(x, k = 20))

## Fit smooth, taking account of *known* correlation...
## form a weight matrix
w <- solve(t(Cv)) # V^{-1} = w'w
## Use `gam' to set up model for fitting...
G <- gam(y ~ s(x, k = 20), fit=FALSE)
## fit using magic, with weight *matrix*
mgfit <- magic(G$y, G$X, G$sp, G$S, G$off, rank=G$rank, C=G$C, w=w)
## Modify previous gam object using new fit, for plotting...    
mg.stuff <- magic.post.proc(G$X, mgfit, w)
b2 <- b ## copy b
b2$edf <- mg.stuff$edf
b2$Vp <- mg.stuff$Vb
b2$coefficients <- mgfit$b 


## compare fits
layout(matrix(1:2, ncol = 2))
plot(b, main = "Ignoring correlation")
lines(x, f-mean(f), col=2)
plot(b2, main = "Known correlation")
lines(x, f-mean(f), col=2)
layout(1)

两次拟合的结果图如下所示：

在此处输入图像描述

你还有一个额外的问题是你的 $V$ 不会像这里那样是一个简单的矩阵。因为你适合Country你的 $V$ 将是一个块对角线（我认为这是正确的术语）矩阵，首先生成每个 $V_{country}$ 然后以对角线方式堆叠它们，使得单个观察值Country彼此相关（因此它们在矩阵中的条目不为零，即 $\rho^{|s|}$ 在哪里 $\rho$ 是估计的 AR(1) 参数，Country是s两个残差在时间点上的分离），但与其他国家的观察结果不相关（这些条目在 $V$ 为零）。至少那是gamm()您显示的代码会产生 IIRC 的内容。

虽然我没有尝试过，但是在 R 中有几个选项可以形成块对角矩阵，你可以使用几年前发布到 R-Help 的函数来粘贴你的个人 $V_{country}$ 一起。其他选项包括转换每个 $V$ 转换为Matrix包可以理解的稀疏矩阵，使用其bdiag()函数，然后使用as.matrix(). 有关详细信息，请参阅Matrix包的帮助。

至于获得每个国家残差的 AR(1) 项，请通过拟合您的模型，gam()但为了帮助，在拟合之前先将所有数据按时间顺序放置在国家/地区内。如果您NA在数据中有任何内容，请确保添加na.action = na.exclude对的调用，gam()因为这将NA放回您resid()从模型中提取的残差中。您可以像这样按国家/地区拆分残差向量：

res <- resid(mod)
res.spl <- split(res, Dat$Country)

然后，您可以简单地sapply()使用函数acf()或pacf()每个组件res.spl来提取每个国家/地区的滞后 1 系数。这些是要插入corAR1()上述示例代码的值。

令人惊讶的是，您的示例代码与我的博士生（尽管他们正在研究湖泊）前几天发给我的代码非常相似。你知道有人在研究高分辨率湖泊数据吗？如果不是，也许他们在这里看到了您的帖子？

其它你可能感兴趣的问题

上一篇无复发生存的传统定义是什么？下一篇在遗传算法中使用交叉验证时如何避免过度拟合