机器算法验证 - 最大计数的测量误差 - 吾爱随笔录

最大计数的测量误差

机器算法验证错误极值

2022-04-04 02:39:35

我熟悉数据平均值的概念以及平均值周围的变化。是否可以量化最大值附近的变化？

例如，以以下 10 年收集的数据为例。我想呈现每个月的最大值，但我也想量化 10 年内每个月的最大值的变化：

counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))

前 20 行数据：

head(counts, 20)

   year month count
1  2000   Jan    14
2  2000   Feb   182
3  2000   Mar   462
4  2000   Apr   395
5  2000   May   107
6  2000   Jun   127
7  2000   Jul   371
8  2000   Aug   158
9  2000   Sep   147
10 2000   Oct    41
11 2000   Nov   141
12 2000   Dec    27
13 2001   Jan    72
14 2001   Feb     7
15 2001   Mar    40
16 2001   Apr   351
17 2001   May   342
18 2001   Jun    81
19 2001   Jul   442
20 2001   Aug   389

我可以使用什么数量：标准偏差？四分位数范围？最大值范围？置信区间？

1个回答

甚至很难用“最大计数的测量误差”来定义您的意思。

在平均值的情况下很容易，因为平均值是生成数据的基础理论分布的参数。这个参数可以连同它的不确定性一起被估计。

另一方面，最大值不是分布的参数——分布没有最大值！因此，当您谈到最大值时，它始终是您的样本的最大值。

这会关闭贝叶斯统计，因为它认为您的数据是固定的。您将不得不使用一些直接的常客方法，该方法将模型视为固定的，并且您的数据实际上是模型的样本。推断可以是直接的，也可以是使用bootstrapping。我在推导复杂的常客最大似然公式方面不是很擅长，所以我只会给你一个关于你的数据的引导示例：

library(boot)

counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))

# this is how you compute the maximum
aggregate(counts$count, list(counts$month), max)

# function which does it for a sub-sample given by `indices`
month_max <- function (data, indices) {
    d <- data[indices,] # allows boot to select sample
    return (tapply(d$count, d$month, max))
}

# bootstrapping with 1000 replications
results <- boot(data=counts, statistic=month_max, R=1000)
results
# ORDINARY NONPARAMETRIC BOOTSTRAP
# [...]
# Bootstrap Statistics :
#      original  bias    std. error
# t1*       466 -28.364    48.41140
# t2*       496 -27.725    40.78849
# t3*       455 -40.789    57.09997
# t4*       499 -32.997    47.74439
# t5*       466 -15.057    34.23477
# t6*       484 -15.966    39.79838
# t7*       491 -24.337    38.84459
# t8*       370 -24.701    39.31971
# t9*       474 -28.850    57.94352
# t10*      448 -23.793    59.52596
# t11*      446 -64.173    84.13633
# t12*      398 -22.229    36.31511

您看到的结果与实际值相对应，但还包括标准误差。你可以看到偏差是相当高的。这表明“真实”最大值不在采样值中，这对于max函数来说是正常的，对于mean.

您也可以报告 CI（也许有更好的方法，但这可行）：

for (i in 1:12) {
    print(boot.ci(results, type="bca", index=i))
}

其它你可能感兴趣的问题