机器算法验证 - 如何通过检测 R 中的异常值进行预测？- 时间序列分析程序和方法 - 吾爱随笔录

如何通过检测 R 中的异常值进行预测？- 时间序列分析程序和方法

机器算法验证 r 时间序列预测有马异常值

2022-01-30 05:40:50

我有每月的时间序列数据，并且想通过检测异常值来进行预测。

这是我的数据集的样本：

       Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
2006  7.55  7.63  7.62  7.50  7.47  7.53  7.55  7.47  7.65  7.72  7.78  7.81
2007  7.71  7.67  7.85  7.82  7.91  7.91  8.00  7.82  7.90  7.93  7.99  7.93
2008  8.46  8.48  9.03  9.43 11.58 12.19 12.23 11.98 12.26 12.31 12.13 11.99
2009 11.51 11.75 11.87 11.91 11.87 11.69 11.66 11.23 11.37 11.71 11.88 11.93
2010 11.99 11.84 12.33 12.55 12.58 12.67 12.57 12.35 12.30 12.67 12.71 12.63
2011 12.60 12.41 12.68 12.48 12.50 12.30 12.39 12.16 12.38 12.36 12.52 12.63

我已经参考了使用 R 的时间序列分析程序和方法来进行一系列不同的预测模型，但它似乎并不准确。此外，我也不确定如何将 tsoutliers 合并到其中。

我在这里也有关于我对 tsoutliers 和 arima 建模和程序的查询的另一篇文章。

所以这些是我目前的代码，类似于链接 1。

代码：

product<-ts(product, start=c(1993,1),frequency=12)

#Modelling product Retail Price

#Training set
product.mod<-window(product,end=c(2012,12))
#Test set
product.test<-window(product,start=c(2013,1))
#Range of time of test set
period<-(end(product.test)[1]-start(product.test)[1])*12 + #No of month * no. of yr
(end(product.test)[2]-start(product.test)[2]+1) #No of months
#Model using different method
#arima, expo smooth, theta, random walk, structural time series
models<-list(
#arima
product.arima<-forecast(auto.arima(product.mod),h=period),
#exp smoothing
product.ets<-forecast(ets(product.mod),h=period),
#theta
product.tht<-thetaf(product.mod,h=period),
#random walk
product.rwf<-rwf(product.mod,h=period),
#Structts
product.struc<-forecast(StructTS(product.mod),h=period)
)

##Compare the training set forecast with test set
par(mfrow=c(2, 3))
for (f in models){
    plot(f)
    lines(product.test,col='red')
}

##To see its accuracy on its Test set, 
#as training set would be "accurate" in the first place
acc.test<-lapply(models, function(f){
    accuracy(f, product.test)[2,]
})
acc.test <- Reduce(rbind, acc.test)
row.names(acc.test)<-c("arima","expsmooth","theta","randomwalk","struc")
acc.test <- acc.test[order(acc.test[,'MASE']),]

##Look at training set to see if there are overfitting of the forecasting
##on training set
acc.train<-lapply(models, function(f){
    accuracy(f, product.test)[1,]
})
acc.train <- Reduce(rbind, acc.train)
row.names(acc.train)<-c("arima","expsmooth","theta","randomwalk","struc")
acc.train <- acc.train[order(acc.train[,'MASE']),]

 ##Note that we look at MAE, MAPE or MASE value. The lower the better the fit.

这是我的不同预测的情节，通过红色“测试集”和蓝色“预测”集的比较，看起来不太可靠/准确。 不同预测图

测试集和训练集各自模型的不同精度

Test set
                    ME      RMSE       MAE        MPE     MAPE      MASE      ACF1 Theil's U
theta      -0.07408833 0.2277015 0.1881167 -0.6037191 1.460549 0.2944165 0.1956893 0.8322151
expsmooth  -0.12237967 0.2681452 0.2268248 -0.9823104 1.765287 0.3549976 0.3432275 0.9847223
randomwalk  0.11965517 0.2916008 0.2362069  0.8823040 1.807434 0.3696813 0.4529428 1.0626775
arima      -0.32556886 0.3943527 0.3255689 -2.5326397 2.532640 0.5095394 0.2076844 1.4452932
struc      -0.39735804 0.4573140 0.3973580 -3.0794740 3.079474 0.6218948 0.3841505 1.6767075

Training set
                     ME      RMSE       MAE         MPE     MAPE      MASE    ACF1 Theil's U
theta      2.934494e-02 0.2101747 0.1046614  0.30793753 1.143115 0.1638029  0.2191889194        NA
randomwalk 2.953975e-02 0.2106058 0.1050209  0.31049479 1.146559 0.1643655  0.2190857676        NA
expsmooth  1.277048e-02 0.2037005 0.1078265  0.14375355 1.176651 0.1687565 -0.0007393747        NA
arima      4.001011e-05 0.2006623 0.1079862 -0.03405395 1.192417 0.1690063 -0.0091275716        NA
struc      5.011615e-03 1.0068396 0.5520857  0.18206018 5.989414 0.8640550  0.1499843508        NA

从模型的准确率可以看出，最准确的模型是theta模型。我不确定为什么预测非常不准确，我认为原因之一是我没有处理数据集中的“异常值”，导致所有模型的预测都不好。

这是我的异常值情节

异常值图

tsoutliers 输出

ARIMA(0,1,0)(0,0,1)[12]                    

Coefficients:
        sma1    LS46    LS51    LS61    TC133   LS181   AO183   AO184   LS185   TC186    TC193    TC200
      0.1700  0.4316  0.6166  0.5793  -0.5127  0.5422  0.5138  0.9264  3.0762  0.5688  -0.4775  -0.4386
s.e.  0.0768  0.1109  0.1105  0.1106   0.1021  0.1120  0.1119  0.1567  0.1918  0.1037   0.1033   0.1040
       LS207    AO237    TC248    AO260    AO266
      0.4228  -0.3815  -0.4082  -0.4830  -0.5183
s.e.  0.1129   0.0782   0.1030   0.0801   0.0805

sigma^2 estimated as 0.01258:  log likelihood=205.91
AIC=-375.83   AICc=-373.08   BIC=-311.19

 Outliers:
    type ind    time coefhat  tstat
1    LS  46 1996:10  0.4316  3.891
2    LS  51 1997:03  0.6166  5.579
3    LS  61 1998:01  0.5793  5.236
4    TC 133 2004:01 -0.5127 -5.019
5    LS 181 2008:01  0.5422  4.841 
6    AO 183 2008:03  0.5138  4.592
7    AO 184 2008:04  0.9264  5.911
8    LS 185 2008:05  3.0762 16.038
9    TC 186 2008:06  0.5688  5.483
10   TC 193 2009:01 -0.4775 -4.624
11   TC 200 2009:08 -0.4386 -4.217
12   LS 207 2010:03  0.4228  3.746
13   AO 237 2012:09 -0.3815 -4.877
14   TC 248 2013:08 -0.4082 -3.965
15   AO 260 2014:08 -0.4830 -6.027
16   AO 266 2015:02 -0.5183 -6.442

我想知道如何通过这些相关数据集和异常值检测等进一步“分析”/预测我的数据。请帮助我处理我的异常值以及进行预测。

最后，我想知道如何将不同的模型预测组合在一起，正如@forecaster 在链接 1 中提到的那样，组合不同的模型很可能会产生更好的预测/预测。

已编辑

我想将异常值合并到其他模型中很好。

我尝试了一些代码，例如。

forecast.ets( res$fit ,h=period,xreg=newxreg)
Error in if (object$components[1] == "A" & is.element(object$components[2], : argument is of length zero

forecast.StructTS(res$fit,h=period,xreg=newxreg)
Error in predict.Arima(object, n.ahead = h) : 'xreg' and 'newxreg' have different numbers of columns

产生了一些错误，我不确定将异常值合并为回归量的正确代码。此外，我如何使用 thetaf 或 rwf，因为没有 forecast.theta 或 forecast.rwf？

2个回答

该答案也与您的其他问题的第 6 点和第 7 点有关。

异常值被理解为模型无法解释的观察结果，因此它们在预测中的作用是有限的，因为不会预测新异常值的存在。您需要做的就是将这些异常值包含在预测方程中。

在加性异常值（影响单个观察值）的情况下，包含该异常值的变量将简单地用零填充，因为在样本中检测到异常值；在水平偏移（数据的永久变化）的情况下，变量将被填充以保持预测的偏移。

接下来，我将展示如何在 ARIMA 模型上获得 R 中的预测，该模型具有由“tsoutliers”检测到的异常值。关键是正确定义newxreg传递给的参数predict。

（这只是为了说明您关于预测时如何处理异常值的问题的答案，我没有解决结果模型或预测是否是最佳解决方案的问题。）

require(tsoutliers)
x <- c(
  7.55,  7.63,  7.62,  7.50,  7.47,  7.53,  7.55,  7.47,  7.65,  7.72,  7.78,  7.81,
  7.71,  7.67,  7.85,  7.82,  7.91,  7.91,  8.00,  7.82,  7.90,  7.93,  7.99,  7.93,
  8.46,  8.48,  9.03,  9.43, 11.58, 12.19, 12.23, 11.98, 12.26, 12.31, 12.13, 11.99,
 11.51, 11.75, 11.87, 11.91, 11.87, 11.69, 11.66, 11.23, 11.37, 11.71, 11.88, 11.93,
 11.99, 11.84, 12.33, 12.55, 12.58, 12.67, 12.57, 12.35, 12.30, 12.67, 12.71, 12.63,
 12.60, 12.41, 12.68, 12.48, 12.50, 12.30, 12.39, 12.16, 12.38, 12.36, 12.52, 12.63)
x <- ts(x, frequency=12, start=c(2006,1))
res <- tso(x, types=c("AO","LS","TC"))

# define the variables containing the outliers for
# the observations outside the sample
npred <- 12 # number of periods ahead to forecast 
newxreg <- outliers.effects(res$outliers, length(x) + npred)
newxreg <- ts(newxreg[-seq_along(x),], start = c(2012, 1))

# obtain the forecasts
p <- predict(res$fit, n.ahead=npred, newxreg=newxreg)

# display forecasts
plot(cbind(x, p$pred), plot.type = "single", ylab = "", type = "n", ylim=c(7,13))
lines(x)
lines(p$pred, type = "l", col = "blue")
lines(p$pred + 1.96 * p$se, type = "l", col = "red", lty = 2)  
lines(p$pred - 1.96 * p$se, type = "l", col = "red", lty = 2)  
legend("topleft", legend = c("observed data", 
  "forecasts", "95% confidence bands"), lty = c(1,1,2,2), 
  col = c("black", "blue", "red", "red"), bty = "n")

编辑

上面使用的函数predict返回基于所选 ARIMA 模型、存储在中res$fit的 ARIMA(2,0,0) 和检测到的异常值的预测res$outliers。我们有一个这样的模型方程：

y_{t} = \sum_{j = 1}^{m} ω_{j} L_{j} (B) I_{t} (t_{j}) + \frac{θ (B)}{ϕ (B) α (B)} ϵ_{t}, ϵ_{t} \sim N I D (0, σ^{2}),

$y_t = \sum_{j=1}^m \omega_j L_j(B) I_t(t_j) + \frac{\theta(B)}{\phi(B) \alpha(B)} \epsilon_t \,, \quad \epsilon_t \sim NID(0, \sigma^2) \,,$

在哪里 $L_j$ 是与 $j$ -第一个异常值（请参阅tsoutliers我对您的其他问题的回答中引用的 Chen 和 Liu 的文档或原始论文）； $I_t$ 是指示变量；最后一项由定义 ARMA 模型的多项式组成。

使用我帮助为您的 72 次观察开发合理模型的软件将包括功率变换（对数），因为误差方差可与预期值相关联。这从原始图中也相当明显，在原始图中，眼睛可以检测到更高级别的方差增加。使用 actual.fit/forecast和最终残差图。请注意考虑到幂变换的更现实的置信限。尽管此响应不使用 R，但它确实提高了使用 R 的合理模型可能包含的标准。

其它你可能感兴趣的问题

上一篇变量选择的冲突方法：AIC、p 值或两者兼而有之？下一篇如何解释密度图的高度