为什么有 xreg 时 auto.arima 不区分?

机器算法验证 r 有马 阿玛克斯
2022-04-01 22:56:11

模拟数据:

dput(test)
structure(list(xx = c(11.68, 11.29, 11.17, 10.41, 9.36, 9.52, 
8.67, 7.69, 8.36, 6.97, 7.05, 7.08, 6.62, 6.35, 5.96, 4.91, 7.25, 
8.66, 7.85, 9, 8.14, 6.99, 7.23, 6.16, 6.42, 6.6, 5.47, 4.85, 
5.12, 4.76, 4.72, 5.32, 5.04, 5.32, 4.83, 4.83, 4.95, 5.28, 5.53, 
6.01, 6.16, 6.08, 5.81, 5.3, 4.94, 5.24, 4.04, 3.79, 4.62, 3.96, 
3.78, 3.55, 4.06, 3.17, 3.32, 3.26, 2.98, 3.74, 3.51, 3.26, 3.58, 
3, 3.37, 3.83, 4.07, 4.5, 3.88, 3.95, 3.98, 4.66, 4.25, 3.94, 
2.67, 3.35, 3.03, 1.32, 1.51, 1.89, 1.89, 1.88, 2.03, 1.75, 1.58, 
1.9, 2.13, 1.86, 1.07, 0.99, 1.32, 1.04, 1.16, 1.2, 1.1, 1.35, 
1.42, 1.2, 1.23, 1.2, 1.17, 1.06, 0.48, 0.59, 0.54, 0.5, 0.52, 
0.55, 0.51, 0.87, 0.84, 1.12, 1.64), exogenous = c(1000812, 996428, 
992312, 983312, 970940, 971216, 972260, 978320, 976604, 977624, 
984456, 988460, 992740, 1002084, 1012104, 1016452, 1032688, 1050108, 
1064876, 1070644, 1079808, 1079192, 1082396, 1086852, 1088284, 
1094408, 1101852, 1112128, 1130888, 1142100, 1156644, 1167744, 
1182440, 1185032, 1194124, 1212376, 1234436, 1246896, 1267536, 
1288632, 1307456, 1323244, 1338256, 1344260, 1345544, 1347708, 
1345300, 1353236, 1373616, 1380512, 1392532, 1401380, 1411232, 
1408736, 1408912, 1422748, 1433548, 1449596, 1464716, 1474380, 
1481728, 1491148, 1509664, 1528212, 1535496, 1536604, 1537388, 
1546744, 1558392, 1573918, 1580083, 1581735, 1581352, 1587900, 
1603940, 1583744, 1544576, 1527264, 1533664, 1549808, 1569424, 
1576764, 1586548, 1601924, 1617568, 1622520, 1642276, 1650212, 
1652392, 1657760, 1662560, 1664068, 1678948, 1688500, 1703332, 
1721832, 1722728, 1739560, 1748676, 1758956, 1755852, 1750036, 
1760456, 1760356, 1773768, 1765508, 1785276, 1799056, 1814848, 
1840508, NA)), .Names = c("xx", "exogenous"), class = c("data.table", 
"data.frame"), row.names = c(NA, -111L), .internal.selfref = <pointer:   0x0000000000090788>)

我有一个绝对不稳定的数据,这很好用:

auto.arima(ts(data = test$xx))

Series: ts(data = test$xx) 
 ARIMA(2,1,2) with drift 

但是,当我使用外生变量并且该过程适合ARIMA-errors时,它不会区分:

auto.arima(ts(data = test$xx), xreg=test$exogenous, trace=TRUE)

Series: ts(data = test$xx) 
Regression with ARIMA(1,0,0) errors

我打开trace并意识到它甚至没有考虑微分:

ARIMA(2,0,2) with non-zero mean : Inf
ARIMA(0,0,0) with non-zero mean : 341.2597324
ARIMA(1,0,0) with non-zero mean : 209.7431147
ARIMA(0,0,1) with non-zero mean : 259.8316269
ARIMA(0,0,0) with zero mean     : 586.1168037
ARIMA(2,0,0) with non-zero mean : 211.9364634
ARIMA(1,0,1) with non-zero mean : 211.9364098
ARIMA(2,0,1) with non-zero mean : Inf
ARIMA(1,0,0) with zero mean     : Inf

外生变量也是非平稳的。我错过了什么?对于ARIMA 错误,我们应该只提供固定数据吗?

基于Rob.Hyndman,他在评论中提到:

“这个问题已经解决了。无需手动进行任何差分。只需使用 xreg=xr ,一切都应该正常工作。”

1个回答

预测函数内部有一个测试是否应该对序列进行差分:

if (is.na(d)) {
    d <- ndiffs(dx, test = test, max.d = max.d)
    if (d > 0 & !is.null(xregg)) {
        diffxreg <- diff(diffxreg, differences = d, lag = 1)
        if (any(apply(diffxreg, 2, is.constant))) 
            d <- d - 1
    }
}

其中d是函数调用中指定的差分顺序(默认为NA)。另一个测试 - nsdiffs- 应用于季节性差分。如果测试未表明存在单位根,则甚至不考虑具有差分的模型,正如人们可能想象的那样,这可以节省大量的运行时间。

关于 OP 中的示例 - 预测函数运行回归lm(xx~exogenous)并将 ARIMA 建模应用于残差。在这个例子中,ACF/PACF 图清楚地表明残差是平稳的,至少在我看来是这样。

为了看到 auto.arima 实际上可以考虑差分残差,我们构建了以下示例,其中显然是非平稳的,并且由于的回归的残差也将是非平稳的(除非非常低概率事件发生。)yxyxy

> y <- rnorm(100, 1:100, 25)
> x <- rnorm(100)
> auto.arima(y, xreg=x, trace=TRUE)

 Regression with ARIMA(2,1,2) errors : Inf
 Regression with ARIMA(0,1,0) errors : 974.5948
 Regression with ARIMA(1,1,0) errors : 953.8159
... more models, removed to save space ...

 ARIMA(2,1,1)                    : 920.2894
 ARIMA(2,1,2)                    : 922.1489
 ARIMA(3,1,2)                    : 923.2468
 ARIMA(1,1,1)                    : 922.377

 Best model: Regression with ARIMA(2,1,1) errors 

Series: y 
Regression with ARIMA(2,1,1) errors 

编辑:更新以回应评论

我复制并粘贴了上面示例中的数据,然后运行:

> length(test$xx)
[1] 111
> length(test$exogenous)
[1] 111
> ndiffs(residuals(lm(xx~exogenous)), max.d=2)
[1] 0

确认 ndiffs 函数实际上为此数据返回 0。