机器算法验证 - 解释负二项式时间序列 - 吾爱随笔录

我正在为其他人处理时间序列数据，这些数据计算了 48 个月内发生关闭的紧急部门相关事件，并希望调查这些关闭的影响。测量是事件的计数或每个时间点的比例。

最初他们要求我使用 Prais-Winsten 时间序列回归，它解释了重复测量之间的自相关。他们现在要求我对计数结果执行负二项式时间序列，以检查建模方法之间的一致性（我最初提出了 Prais-Winsten 的替代方案，例如 GARCH 或 ARIMA 模型，但他们不感兴趣）。

我现在已经执行了要求的分析，但我正在努力协调这两种方法的结果，非常感谢您的建议。

数据

> dput(sample)
structure(list(relative.month = 1:48, value = c(2407, 2019, 2623,
 2443, 2043, 2605, 2623, 2365, 2822, 2558, 2144, 2601, 3034, 2878,
 2883, 1995, 2323, 2861, 2193, 2729, 2859, 3118, 2064, 2350, 1701,
 1665, 1848, 1428, 1371, 2043, 1218, 790, 1502, 1591, 1661, 2074,
 713, 1531, 1460, 1444, 1391, 1632, 1623, 934, 1718, 1217, 1439,
 1377), step = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), town = c(1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1)), .Names = c("relative.month", "value", "step", "town"), row.names =     c(NA, -48L), class = "data.frame")

## Pre and post Means
> group_by(sample, step) %>%
  summarise(mean = mean(value, na.rm = TRUE),
            sd   = sd(value, na.rm = TRUE))
  step     mean       sd
   <dbl>    <dbl>    <dbl>
 1     0 2522.500 336.5390
 2     1 1473.792 334.4394

## Difference in means Before (0) and After (1)
> 2522.500 - 1473.792
 [1] 1048.708
## Percentage difference in means
> 100* (2522.500 - 1473.792) / 2522.500
 [1] 41.57415

快速绘制数据图......

普拉斯-温斯滕

一步运行一个非常简单的 Prais-Winsten 回归并考虑自相关......

## Simple Prais-Winsten Regression with 1-step
model <- reformulate(response   = 'value',
                     termlabels = c('step'))
pwr.1step <- panelAR(formula         = model,
                     data            = sample,
                     panelVar        = 'town',
                     timeVar         = 'relative.month',
                     autoCorr        = 'ar1',
                     panelCorrMethod = 'pcse',
                     rho.na.rm       = TRUE,
                     seq.times       = FALSE)
pwr.1step %>% summary()

Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors

 Balanced Panel Design:
  Total obs.:       48 Avg obs. per panel 48
  Number of panels: 1  Max obs. per panel 48
  Number of times:  48 Min obs. per panel 48

 Coefficients:
             Estimate Std. Error t value Pr(>|t|)
 (Intercept)  2524.55      59.99   42.09  < 2e-16 ***
 step        -1051.90      85.00  -12.38 3.05e-16 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 R-squared: 0.7598
 Wald statistic: 153.1524, Pr(>Chisq(1)): 0

我对此的解释是，该步骤导致观察到的事件数量平均减少，大约为 908 个事件（/平均值），这与 (0) 之前和 (1) 之后的均值差异相差不远即 904（或减少 37.56%）。

包括（相对）时间作为协变量来估计随时间的变化......

## Simple Prais-Winsten Regression with 1-step and time as co-variate
model <- reformulate(response   = 'value',
                     termlabels = c('closure', 'relative.month'))
pwr.1step.time <- panelAR(formula         = model,
                          data            = sample,
                          panelVar        = 'town',
                          timeVar         = 'relative.month',
                          autoCorr        = 'ar1',
                          panelCorrMethod = 'pcse',
                          rho.na.rm       = TRUE,
                          seq.times       = FALSE)
pwr.1step.time %>% summary()
Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors

 Balanced Panel Design:
  Total obs.:       48 Avg obs. per panel 48
  Number of panels: 1  Max obs. per panel 48
  Number of times:  48 Min obs. per panel 48

 Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
 (Intercept)     2526.3522    97.9934  25.781  < 2e-16 ***
 step           -1048.3859   172.1547  -6.090  2.3e-07 ***
 relative.month    -0.1453     6.2160  -0.023    0.981
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 R-squared: 0.7596
 Wald statistic: 152.9394, Pr(>Chisq(2)): 0

与更简单的模型一样，该步骤之后的事件数量仍然减少，但包含时间变量表明，在整个时间序列中（即在 (0) 之前和 (1) 之后），往往会减少随着时间的推移事件的数量，虽然这在步骤之后更加明显。

负二项式时间序列

只需一步即可在非常简单的模型上运行负二项式时间序列，但允许连续观察之间的相关性......

## Negative Binomial Time-Series with 1-step
> outcome <- sample[,'value']
> covariates <- dplyr::select(sample, step)
> negbin.1step <- tsglm(ts    = outcome,
                        link  = 'log',
                        model = list(past_obs = 1),
                        xreg  = covariates,
                        distr = 'nbinom')
> negbin.1step %>% summary()
 Call:
 tsglm(ts = outcome, model = list(past_obs = 1), xreg = covariates,
     link = "log", distr = "nbinom")

 Coefficients:
              Estimate  Std.Error  CI(lower)  CI(upper)
 (Intercept)    8.3258     1.0195      6.328     10.324
 beta_1        -0.0629     0.1301     -0.318      0.192
 step          -0.5715     0.0888     -0.746     -0.397
 sigmasq        0.0349         NA         NA         NA
 Standard errors and confidence intervals (level =  95 %) obtained
 by normal approximation.

 Link function: log
 Distribution family: nbinom (with overdispersion coefficient 'sigmasq')
 Number of coefficients: 4
 Log-likelihood: -351.8314
 AIC: 711.6627
 BIC: 719.1476
 QIC: 3760.517
## Exponentiate the coefficients to obtain a percentage change
> negbin.1step %>% coefficients() %>% exp()
 (Intercept)       beta_1         step
 4128.9061822    0.9389960    0.5646913

通读tscount 小插图后，我认为这意味着在该步骤之后每个月观察到的事件数量（大约）减少了 44%，并且事件数量减少了小幅（~7%）随着时间的推移（beta_1系数）。

添加时间作为协变量以匹配 Prais-Winsten 回归给出...

## Negative Binomial Time-Series with 1-step and time
> outcome <- sample[,'value']
> covariates <- dplyr::select(sample, step, relative.month)
> negbin.1step.time <- tsglm(ts    = outcome,
                             link  = 'log',
                             model = list(past_obs = 1),
                             xreg  = covariates,
                             distr = 'nbinom')
> negbin.1step.time %>% summary()
 Call:
 tsglm(ts = outcome, model = list(past_obs = 1), xreg = covariates,
 link = "log", distr = "nbinom")

 Coefficients:
                  Estimate  Std.Error  CI(lower)  CI(upper)
 (Intercept)      8.335323    1.04206    6.29292   10.37773
 beta_1          -0.063606    0.13211   -0.32255    0.19533
 step            -0.563206    0.12753   -0.81316   -0.31325
 relative.month  -0.000357    0.00412   -0.00843    0.00771
 sigmasq          0.035660         NA         NA         NA
 Standard errors and confidence intervals (level =  95 %) obtained
 by normal approximation.

 Link function: log
 Distribution family: nbinom (with overdispersion coefficient 'sigmasq')
 Number of coefficients: 5
 Log-likelihood: -351.7691
 AIC: 713.5383
 BIC: 722.8943
 QIC: 3923.615

> negbin.1step.time %>% coefficients() %>% exp()
  (Intercept)         beta_1           step relative.month
  4168.5475445      0.9383745      0.5693807      0.9996430

在这一步有大致相同的减少，并且考虑到随着时间的推移轻微减少的相关性 (the beta_1)，relative.month协变量几乎没有随时间变化。

问题

等效性 - 结果告诉我大致相同的事情，即由步骤引起的事件数量减少，并且随着时间的推移，事件数量减少。
beta_1考虑到它估计连续观察对之间的相关性，将负二项式中的分量解释为大致相当于相对月份的影响是否合理？如果是这样，是否不需要relative.month在负二项式时间序列中包含协变量？
当观察到的事件数以千计时，使用负二项式时间序列模型是否合理？有人建议这样做可能不合适，因为负二项式分布适用于过度分散（即由于可能的偏斜而导致高方差）的计数。
从上一个问题开始，哪种方法最适合报告？