机器算法验证 - 如何在 R 中实现模型？ - 吾爱随笔录

如何在 R 中实现模型？

机器算法验证 r 时间序列多重回归

2022-03-30 01:37:46

我希望你能帮助我在 R 中实现这个模型

在此处输入图像描述

或更明确

在此处输入图像描述

在哪里

yt = 月平均值
μi = 第 i 个月的平均值，i = 1。. . 12.
I1;t = 一年中第 i 个月的指标系列，即如果该月对应于一年中的第 i 个月，则为 1，否则为 0。
bi = 一年中第 i 个月的趋势。
Rt = t/12
Z1;t= 回归量 1，c1 是相关系数。
Z2;t= 回归量 2，c2 是相关系数。
Nt = 残余噪声系列，建模为自回归 AR(1) 系列

所以我想为每月和趋势分量获得 12 个系数，为回归量 1 和 2 获得 1 个系数。

1个回答

这是拟合您描述的模型的一种方法。

# sample series
x <- AirPassengers
# to illustrate a more general case, 
# take a subsample that does not start in the first season
# and ends in the last season
x <- window(x, start=c(1949,2), end=c(1959,4))

可以通过多种方式创建季节性截距的指标变量，例如：

# monthly intercepts
S <- frequency(x)
monthly.means <- do.call("rbind", 
  replicate(ceiling(length(x)/S), diag(S), simplify = FALSE))

非平方时间序列的一些安排（如果序列在第一季开始并在最后一季结束，这将无效）：

monthly.means <- ts(monthly.means, frequency = S, start = c(start(x)[1], 1))
monthly.means <- window(monthly.means, start = start(x), end = end(x))

不管第一个观测属于哪个月份，第一列与一月相关，第二列与二月相关，依此类推。

# column names
if (S == 12) {
  colnames(monthly.means) <- month.abb
} else
  colnames(monthly.means) <- paste("season", 1L:S)

对象的第一行monthly.means：

monthly.means[1:15,]
#       Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#  [1,]   0   1   0   0   0   0   0   0   0   0   0   0
#  [2,]   0   0   1   0   0   0   0   0   0   0   0   0
#  [3,]   0   0   0   1   0   0   0   0   0   0   0   0
#  [4,]   0   0   0   0   1   0   0   0   0   0   0   0
#  [5,]   0   0   0   0   0   1   0   0   0   0   0   0
#  [6,]   0   0   0   0   0   0   1   0   0   0   0   0
#  [7,]   0   0   0   0   0   0   0   1   0   0   0   0
#  [8,]   0   0   0   0   0   0   0   0   1   0   0   0
#  [9,]   0   0   0   0   0   0   0   0   0   1   0   0
# [10,]   0   0   0   0   0   0   0   0   0   0   1   0
# [11,]   0   0   0   0   0   0   0   0   0   0   0   1
# [12,]   1   0   0   0   0   0   0   0   0   0   0   0
# [13,]   0   1   0   0   0   0   0   0   0   0   0   0
# [14,]   0   0   1   0   0   0   0   0   0   0   0   0
# [15,]   0   0   0   1   0   0   0   0   0   0   0   0

对于每月趋势，我们可以重用“monthly.means”：

monthly.trends <- monthly.means * seq_along(x) / S
round(monthly.trends[1:15,], 2)
#       Jan  Feb  Mar  Apr  May  Jun Jul  Aug  Sep  Oct  Nov  Dec
#  [1,]   0 0.08 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00
#  [2,]   0 0.00 0.17 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00
#  [3,]   0 0.00 0.00 0.25 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00
#  [4,]   0 0.00 0.00 0.00 0.33 0.00 0.0 0.00 0.00 0.00 0.00 0.00
#  [5,]   0 0.00 0.00 0.00 0.00 0.42 0.0 0.00 0.00 0.00 0.00 0.00
#  [6,]   0 0.00 0.00 0.00 0.00 0.00 0.5 0.00 0.00 0.00 0.00 0.00
#  [7,]   0 0.00 0.00 0.00 0.00 0.00 0.0 0.58 0.00 0.00 0.00 0.00
#  [8,]   0 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.67 0.00 0.00 0.00
#  [9,]   0 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.75 0.00 0.00
# [10,]   0 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.83 0.00
# [11,]   0 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.92
# [12,]   1 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00
# [13,]   0 1.08 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00
# [14,]   0 0.00 1.17 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00
# [15,]   0 0.00 0.00 1.25 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00

一些任意的回归器：

set.seed(123)
xreg1 <- runif(length(x), 100, 200)
xreg2 <- rnorm(length(x), mean = mean(x))

已编辑

正如我在第一次编辑中所做的那样，现在可以使用函数来拟合模型lm。

lm(x ~ 0 + monthly.means + monthly.trends + xreg1 + xreg2)

但是我忽略了问题中提到的错误术语的 AR(1) 结构。正如@RobHyndman 在下面的评论中所说，我们可以使用arima为错误项指定 AR(1) 过程的函数order = c(1,0,0)。不应包含截距，以避免与季节性平均值的虚拟变量发生精确的多重共线性。（示例是任意的，打印输出只是为了显示估计的系数。）

xreg <- cbind(monthly.means, monthly.trends, xreg1, xreg2)
fit <- arima(x, order = c(1,0,0), xreg = xreg, include.mean = FALSE)
round(coef(fit), 2)
#                ar1  monthly.means.Jan  monthly.means.Feb  monthly.means.Mar 
#               0.84              69.87              82.45              93.10 
#  monthly.means.Apr  monthly.means.May  monthly.means.Jun  monthly.means.Jul 
#              85.17              76.39              77.48              83.31 
#  monthly.means.Aug  monthly.means.Sep  monthly.means.Oct  monthly.means.Nov 
#              81.01              81.23              68.06              57.34 
#  monthly.means.Dec monthly.trends.Jan monthly.trends.Feb monthly.trends.Mar 
#              73.20              27.07              23.36              27.84 
# monthly.trends.Apr monthly.trends.May monthly.trends.Jun monthly.trends.Jul 
#              27.56              29.34              35.98              40.47 
# monthly.trends.Aug monthly.trends.Sep monthly.trends.Oct monthly.trends.Nov 
#              40.30              32.07              27.85              24.02 
# monthly.trends.Dec              xreg1              xreg2 
#              25.60               0.00               0.08

我们可以做一些简单的检查。在仅包含季节性截距的模型中，系数的估计值与样本均值的值相匹配。

fit2 <- arima(x, order = c(0,0,0), xreg = xreg[,1:12], include.mean = FALSE)
coef.seasonal.intercepts <- coef(fit2)
sample.means <- lapply(split(x, cycle(x)), mean)
all.equal(coef.seasonal.intercepts,
  as.vector(unlist(sample.means)), check.attributes = FALSE)
# TRUE

已编辑（回答 OP 发表的评论。）

如果扰动项遵循 AR(1) 过程，则普通最小二乘估计量无偏但效率不高，即平均而言，它给出了真实值，但参数估计的标准误差高于独立误差的经典设置（在换句话说，估计是无效的）。

正如您所说，使用 AR(1) 误差项扩展模型arima不会对估计有太大影响（标准 OLS 仍然是无偏的），但由于效率的提高，它们的标准误差会更小。

当干扰项是自相关的时，省略 AR 项将导致估计的标准误差更大，这意味着 t 统计量中的分母更大，因此 t 统计量也更大。因此，对非显着回归变量的检验将偏向于非拒绝。使用arima指定 AR 错误术语将防止出现此问题。

下面的代码是检查这些想法的一个小练习：10,000 个序列是从一个具有截距和 AR(1) 错误的模型生成的。如果xregcoef设置为非零值，则将外部变量添加到数据中。包含截距和外部回归量的模型通过arimawith order=c(1,0,0)(stored in fit1) 和( lmstored in ) 进行拟合fit2。

set.seed(123)
xreg <- runif(200, 2, 6)
xregcoef <- 0
res <- matrix(nrow = 10000, ncol = 6)
colnames(res) <- c("coef 1", "s.e. 1", "t-stat. 1", "coef 2", "s.e. 2", "t-stat. 2")
for (i in seq.int(nrow(res)))
{
  x <- 2 + arima.sim(n=200, model=list(ar=0.7)) + xregcoef * xreg    
  fit1 <- arima(x, order=c(1,0,0), xreg=xreg, include.mean=TRUE)
  res[i,1] <- coef(fit1)["xreg"]
  res[i,2] <- sqrt(fit1$var.coef["xreg","xreg"])
  res[i,3] <- res[i,1]/res[i,2]    
  fit2 <- summary(lm(x ~ 1 + xreg))
  res[i,4:6] <- coef(fit2)["xreg", c("Estimate", "Std. Error", "t value")]
}

例如，我们可以看到两个回归的估计值都非常接近真实值（无偏估计），但当 AR 结构被忽略时，xregcoef=3标准误略高。lm

# results for xregcoef=3
t(apply(res, 2, summary))
#               Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
# coef 1     2.81600  2.96300  3.00100  3.00000  3.03800  3.20100
# s.e. 1     0.04313  0.05286  0.05491  0.05492  0.05693  0.06677
# coef 2     2.65800  2.93300  3.00100  2.99900  3.06400  3.37000
# s.e. 2     0.05928  0.08393  0.08879  0.08905  0.09400  0.12280

现在让我们考虑xregcoef=0，也就是说，回归器不是数据生成过程的一部分，但我们拟合了一个包含该回归器的模型。平均而言，在这两种情况下，系数估计为零，但是，由于在第二种情况下标准误差略高，因此在下面选择 5% 的显着性水平的情况下，t 检验的空值被拒绝的频率略高于应有的水平.

# results for xregcoef=0
t(apply(res, 2, summary))
#               Min.  1st Qu.    Median       Mean 3rd Qu.    Max.
# coef 1    -0.18440 -0.03698 0.0007062  0.0003874 0.03763 0.20080
# s.e. 1     0.04313  0.05286 0.0549100  0.0549200 0.05693 0.06677
# t-stat. 1 -3.68100 -0.67410 0.0129200  0.0072410 0.68500 3.53400
# coef 2    -0.34200 -0.06694 0.0012510 -0.0006191 0.06422 0.36970
# s.e. 2     0.05928  0.08393 0.0887900  0.0890500 0.09400 0.12280
# t-stat. 2 -3.96600 -0.75560 0.0138800 -0.0070140 0.72210 4.44500
#
# rejections of the null xregcoef=0 at the 5% significance level 
sum(abs(res[,3]) > 1.96) / nrow(res)
# [1] 0.0516
sum(abs(res[,6]) > 1.96) / nrow(res)
# [1] 0.0749

其它你可能感兴趣的问题

上一篇朴素贝叶斯中的平衡数据集下一篇具有 R 函数的高尔距离；“gower.dist”和“雏菊”