机器算法验证 - 如何确定哪种分布最适合我的数据？ - 吾爱随笔录

如何确定哪种分布最适合我的数据？

机器算法验证 r 分布拟合优度 kolmogorov-smirnov 测试分布识别

2022-02-04 11:59:50

我有一个数据集，想找出最适合我的数据的分布。

我使用该fitdistr()函数来估计描述假设分布的必要参数（即 Weibull、Cauchy、Normal）。使用这些参数，我可以进行 Kolmogorov-Smirnov 检验来估计我的样本数据是否来自与我假设的分布相同的分布。

如果 p 值 > 0.05，我可以假设样本数据来自同一分布。但是 p 值并没有提供任何关于合身性的信息，不是吗？

因此，如果对于正态分布和威布尔分布，我的样本数据的 p 值 > 0.05，我如何知道哪种分布更适合我的数据？

这基本上是我所做的：

> mydata
 [1] 37.50 46.79 48.30 46.04 43.40 39.25 38.49 49.51 40.38 36.98 40.00
[12] 38.49 37.74 47.92 44.53 44.91 44.91 40.00 41.51 47.92 36.98 43.40
[23] 42.26 41.89 38.87 43.02 39.25 40.38 42.64 36.98 44.15 44.91 43.40
[34] 49.81 38.87 40.00 52.45 53.13 47.92 52.45 44.91 29.54 27.13 35.60
[45] 45.34 43.37 54.15 42.77 42.88 44.26 27.14 39.31 24.80 16.62 30.30
[56] 36.39 28.60 28.53 35.84 31.10 34.55 52.65 48.81 43.42 52.49 38.00
[67] 38.65 34.54 37.70 38.11 43.05 29.95 32.48 24.63 35.33 41.34

# estimate shape and scale to perform KS-test for weibull distribution
> fitdistr(mydata, "weibull")
     shape        scale   
   6.4632971   43.2474500 
 ( 0.5800149) ( 0.8073102)

# KS-test for weibull distribution
> ks.test(mydata, "pweibull", scale=43.2474500, shape=6.4632971)

        One-sample Kolmogorov-Smirnov test

data:  mydata
D = 0.0686, p-value = 0.8669
alternative hypothesis: two-sided

# KS-test for normal distribution
> ks.test(mydata, "pnorm", mean=mean(mydata), sd=sd(mydata))

        One-sample Kolmogorov-Smirnov test

data:  mydata
D = 0.0912, p-value = 0.5522
alternative hypothesis: two-sided

Weibull 分布的 p 值为 0.8669，正态分布的 p 值为 0.5522。因此，我可以假设我的数据遵循 Weibull 以及正态分布。但是哪个分布函数更好地描述了我的数据？

参考十一美元我找到了以下代码，但不知道如何解释结果：

fits <- list(no = fitdistr(mydata, "normal"),
             we = fitdistr(mydata, "weibull"))
sapply(fits, function(i) i$loglik)
       no        we 
-259.6540 -257.9268

2个回答

首先，这里有一些简短的评论：

这 $p$ 带有估计参数的 Kolmovorov-Smirnov-Test (KS-Test) 的值可能非常错误，因为p值没有考虑估计的不确定性。所以不幸的是，您不能只拟合一个分布，然后使用 Kolmogorov-Smirnov-Test 中的估计参数来测试您的样本。有一个称为Lilliefors 检验的正态性检验，它是 KS 检验的修改版本，允许估计参数。
您的样本永远不会完全遵循特定的分布。所以即使你的 $p$ - 来自 KS-Test 的值将是有效的，并且 $>0.05$ ，这只是意味着你不能排除你的数据遵循这个特定的分布。另一种表述是您的样本与某个分布兼容。但是“我的数据是否完全遵循分布 xy？”这个问题的答案。总是没有。
这里的目标不能是确定您的样本遵循什么分布。目标是@whuber（在评论中）所说的对数据的简约近似描述。具有特定的参数分布可以用作数据的模型（例如模型“地球是球体”可能很有用）。

但是让我们做一些探索。我将使用优秀的fitdistrplus包，它为分布拟合提供了一些很好的功能。我们将使用该函数descdist来获得有关可能的候选分布的一些想法。

library(fitdistrplus)
library(logspline)

x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)

现在让我们使用descdist：

descdist(x, discrete = FALSE)

Descdist

您的样本的峰度和平方偏度被绘制为一个名为“观察”的蓝点。似乎可能的分布包括 Weibull、Lognormal 和可能的 Gamma 分布。

让我们拟合 Weibull 分布和正态分布：

fit.weibull <- fitdist(x, "weibull")
fit.norm <- fitdist(x, "norm")

现在检查是否适合正常：

plot(fit.norm)

正常合身

对于 Weibull 拟合：

plot(fit.weibull)

威布尔拟合

两者看起来都不错，但从 QQ-Plot 判断，Weibull 可能看起来更好一些，尤其是在尾部。相应地，Weibull 拟合的 AIC 低于正常拟合：

fit.weibull$aic
[1] 519.8537

fit.norm$aic
[1] 523.3079

Kolmogorov-Smirnov 测试模拟

我将使用此处解释的@Aksakal 程序来模拟空值下的 KS 统计量。

n.sims <- 5e4

stats <- replicate(n.sims, {      
  r <- rweibull(n = length(x)
                , shape= fit.weibull$estimate["shape"]
                , scale = fit.weibull$estimate["scale"]
  )
  estfit.weibull <- fitdist(r, "weibull") # added to account for the estimated parameters
  as.numeric(ks.test(r
                     , "pweibull"
                     , shape= estfit.weibull$estimate["shape"]
                     , scale = estfit.weibull$estimate["scale"])$statistic
  )      
})

模拟 KS 统计量的 ECDF 如下所示：

plot(ecdf(stats), las = 1, main = "KS-test statistic simulation (CDF)", col = "darkorange", lwd = 1.7)
grid()

模拟 KS 统计

最后，我们的 $p$ - 使用 KS 统计量的模拟零分布的值是：

fit <- logspline(stats)

1 - plogspline(ks.test(x
                       , "pweibull"
                       , shape= fit.weibull$estimate["shape"]
                       , scale = fit.weibull$estimate["scale"])$statistic
               , fit
)

[1] 0.4889511

这证实了我们的图形结论，即样本与 Weibull 分布兼容。

正如这里所解释的，我们可以使用自举将逐点置信区间添加到估计的 Weibull PDF 或 CDF：

xs <- seq(10, 65, len=500)

true.weibull <- rweibull(1e6, shape= fit.weibull$estimate["shape"]
                         , scale = fit.weibull$estimate["scale"])

boot.pdf <- sapply(1:1000, function(i) {
  xi <- sample(x, size=length(x), replace=TRUE)
  MLE.est <- suppressWarnings(fitdist(xi, distr="weibull"))  
  dweibull(xs, shape=MLE.est$estimate["shape"],  scale = MLE.est$estimate["scale"])
}
)

boot.cdf <- sapply(1:1000, function(i) {
  xi <- sample(x, size=length(x), replace=TRUE)
  MLE.est <- suppressWarnings(fitdist(xi, distr="weibull"))  
  pweibull(xs, shape= MLE.est$estimate["shape"],  scale = MLE.est$estimate["scale"])
}
)   

#-----------------------------------------------------------------------------
# Plot PDF
#-----------------------------------------------------------------------------

par(bg="white", las=1, cex=1.2)
plot(xs, boot.pdf[, 1], type="l", col=rgb(.6, .6, .6, .1), ylim=range(boot.pdf),
     xlab="x", ylab="Probability density")
for(i in 2:ncol(boot.pdf)) lines(xs, boot.pdf[, i], col=rgb(.6, .6, .6, .1))

# Add pointwise confidence bands

quants <- apply(boot.pdf, 1, quantile, c(0.025, 0.5, 0.975))
min.point <- apply(boot.pdf, 1, min, na.rm=TRUE)
max.point <- apply(boot.pdf, 1, max, na.rm=TRUE)
lines(xs, quants[1, ], col="red", lwd=1.5, lty=2)
lines(xs, quants[3, ], col="red", lwd=1.5, lty=2)
lines(xs, quants[2, ], col="darkred", lwd=2)

CI_密度

#-----------------------------------------------------------------------------
# Plot CDF
#-----------------------------------------------------------------------------

par(bg="white", las=1, cex=1.2)
plot(xs, boot.cdf[, 1], type="l", col=rgb(.6, .6, .6, .1), ylim=range(boot.cdf),
     xlab="x", ylab="F(x)")
for(i in 2:ncol(boot.cdf)) lines(xs, boot.cdf[, i], col=rgb(.6, .6, .6, .1))

# Add pointwise confidence bands

quants <- apply(boot.cdf, 1, quantile, c(0.025, 0.5, 0.975))
min.point <- apply(boot.cdf, 1, min, na.rm=TRUE)
max.point <- apply(boot.cdf, 1, max, na.rm=TRUE)
lines(xs, quants[1, ], col="red", lwd=1.5, lty=2)
lines(xs, quants[3, ], col="red", lwd=1.5, lty=2)
lines(xs, quants[2, ], col="darkred", lwd=2)
#lines(xs, min.point, col="purple")
#lines(xs, max.point, col="purple")

CI_CDF

使用 GAMLSS 进行自动分布拟合

该gamlss软件包R提供了尝试许多不同分布并根据 GAIC（广义 Akaike 信息标准）选择“最佳”的能力。主要功能是fitDist。此函数中的一个重要选项是尝试的分布类型。例如，设置type = "realline"将尝试在整个实线上定义的所有已实现分布，而type = "realsplus"仅尝试在实正线上定义的分布。另一个重要的选项是参数 $k$ ，这是对 GAIC 的处罚。在下面的示例中，我设置了参数 $k = 2$ 这意味着根据经典 AIC 选择“最佳”分布。你可以设置 $k$ 任何你喜欢的东西，比如 $\log(n)$ 为 BIC。

library(gamlss)
library(gamlss.dist)
library(gamlss.add)

x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
       38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
       42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
       49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
       45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
       36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
       38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)

fit <- fitDist(x, k = 2, type = "realplus", trace = FALSE, try.gamlss = TRUE)

summary(fit)

*******************************************************************
Family:  c("WEI2", "Weibull type 2") 

Call:  gamlssML(formula = y, family = DIST[i], data = sys.parent()) 

Fitting method: "nlminb" 


Coefficient(s):
             Estimate  Std. Error  t value   Pr(>|t|)    
eta.mu    -24.3468041   2.2141197 -10.9962 < 2.22e-16 ***
eta.sigma   1.8661380   0.0892799  20.9021 < 2.22e-16 ***

根据 AIC，Weibull 分布（更具体地说WEI2，它的特殊参数化）最适合数据。分布的精确参数化在第 279 页的文档WEI2中有详细说明。让我们通过查看蠕虫图（基本上是去趋势的 QQ 图）中的残差来检查拟合：

我们预计残差接近中间水平线，其中 95% 位于上虚线和下虚线之间，这相当于 95% 的逐点置信区间。在这种情况下，蠕虫图对我来说看起来很好，表明 Weibull 分布是合适的。

绘图主要是更好地了解数据外观的好方法。在您的情况下，我建议使用您从 fitdistr() 获得的参数针对理论 cdfs绘制经验累积分布函数(ecdf)。

我为我的数据做了一次，还包括了置信区间。这是我使用 ggplot2() 得到的图片。

在此处输入图像描述

黑线是经验累积分布函数，彩色线是使用最大似然法获得的参数来自不同分布的 cdf。可以很容易地看出，指数分布和正态分布不太适合数据，因为线条的形式与 ecdf 不同，而且线条离 ecdf 很远。不幸的是，其他分布非常接近。但我会说 logNormal 线最接近黑线。使用距离度量（例如 MSE）可以验证假设。

如果您只有两个相互竞争的分布（例如选择似乎最适合绘图的分布），您可以使用似然比检验来测试哪些分布更适合。

其它你可能感兴趣的问题

上一篇如何在 R 中获取 data.frame 的行数？下一篇统计笑话