首先,这里有一些简短的评论:
- 这p带有估计参数的 Kolmovorov-Smirnov-Test (KS-Test) 的值可能非常错误,因为p值没有考虑估计的不确定性。所以不幸的是,您不能只拟合一个分布,然后使用 Kolmogorov-Smirnov-Test 中的估计参数来测试您的样本。有一个称为Lilliefors 检验的正态性检验,它是 KS 检验的修改版本,允许估计参数。
- 您的样本永远不会完全遵循特定的分布。所以即使你的p- 来自 KS-Test 的值将是有效的,并且> 0.05,这只是意味着你不能排除你的数据遵循这个特定的分布。另一种表述是您的样本与某个分布兼容。但是“我的数据是否完全遵循分布 xy?”这个问题的答案。总是没有。
- 这里的目标不能是确定您的样本遵循什么分布。目标是@whuber(在评论中)所说的对数据的简约近似描述。具有特定的参数分布可以用作数据的模型(例如模型“地球是球体”可能很有用)。
但是让我们做一些探索。我将使用优秀的fitdistrplus
包,它为分布拟合提供了一些很好的功能。我们将使用该函数descdist
来获得有关可能的候选分布的一些想法。
library(fitdistrplus)
library(logspline)
x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)
现在让我们使用descdist
:
descdist(x, discrete = FALSE)
您的样本的峰度和平方偏度被绘制为一个名为“观察”的蓝点。似乎可能的分布包括 Weibull、Lognormal 和可能的 Gamma 分布。
让我们拟合 Weibull 分布和正态分布:
fit.weibull <- fitdist(x, "weibull")
fit.norm <- fitdist(x, "norm")
现在检查是否适合正常:
plot(fit.norm)
对于 Weibull 拟合:
plot(fit.weibull)
两者看起来都不错,但从 QQ-Plot 判断,Weibull 可能看起来更好一些,尤其是在尾部。相应地,Weibull 拟合的 AIC 低于正常拟合:
fit.weibull$aic
[1] 519.8537
fit.norm$aic
[1] 523.3079
Kolmogorov-Smirnov 测试模拟
我将使用此处解释的@Aksakal 程序来模拟空值下的 KS 统计量。
n.sims <- 5e4
stats <- replicate(n.sims, {
r <- rweibull(n = length(x)
, shape= fit.weibull$estimate["shape"]
, scale = fit.weibull$estimate["scale"]
)
estfit.weibull <- fitdist(r, "weibull") # added to account for the estimated parameters
as.numeric(ks.test(r
, "pweibull"
, shape= estfit.weibull$estimate["shape"]
, scale = estfit.weibull$estimate["scale"])$statistic
)
})
模拟 KS 统计量的 ECDF 如下所示:
plot(ecdf(stats), las = 1, main = "KS-test statistic simulation (CDF)", col = "darkorange", lwd = 1.7)
grid()
最后,我们的p- 使用 KS 统计量的模拟零分布的值是:
fit <- logspline(stats)
1 - plogspline(ks.test(x
, "pweibull"
, shape= fit.weibull$estimate["shape"]
, scale = fit.weibull$estimate["scale"])$statistic
, fit
)
[1] 0.4889511
这证实了我们的图形结论,即样本与 Weibull 分布兼容。
正如这里所解释的,我们可以使用自举将逐点置信区间添加到估计的 Weibull PDF 或 CDF:
xs <- seq(10, 65, len=500)
true.weibull <- rweibull(1e6, shape= fit.weibull$estimate["shape"]
, scale = fit.weibull$estimate["scale"])
boot.pdf <- sapply(1:1000, function(i) {
xi <- sample(x, size=length(x), replace=TRUE)
MLE.est <- suppressWarnings(fitdist(xi, distr="weibull"))
dweibull(xs, shape=MLE.est$estimate["shape"], scale = MLE.est$estimate["scale"])
}
)
boot.cdf <- sapply(1:1000, function(i) {
xi <- sample(x, size=length(x), replace=TRUE)
MLE.est <- suppressWarnings(fitdist(xi, distr="weibull"))
pweibull(xs, shape= MLE.est$estimate["shape"], scale = MLE.est$estimate["scale"])
}
)
#-----------------------------------------------------------------------------
# Plot PDF
#-----------------------------------------------------------------------------
par(bg="white", las=1, cex=1.2)
plot(xs, boot.pdf[, 1], type="l", col=rgb(.6, .6, .6, .1), ylim=range(boot.pdf),
xlab="x", ylab="Probability density")
for(i in 2:ncol(boot.pdf)) lines(xs, boot.pdf[, i], col=rgb(.6, .6, .6, .1))
# Add pointwise confidence bands
quants <- apply(boot.pdf, 1, quantile, c(0.025, 0.5, 0.975))
min.point <- apply(boot.pdf, 1, min, na.rm=TRUE)
max.point <- apply(boot.pdf, 1, max, na.rm=TRUE)
lines(xs, quants[1, ], col="red", lwd=1.5, lty=2)
lines(xs, quants[3, ], col="red", lwd=1.5, lty=2)
lines(xs, quants[2, ], col="darkred", lwd=2)
#-----------------------------------------------------------------------------
# Plot CDF
#-----------------------------------------------------------------------------
par(bg="white", las=1, cex=1.2)
plot(xs, boot.cdf[, 1], type="l", col=rgb(.6, .6, .6, .1), ylim=range(boot.cdf),
xlab="x", ylab="F(x)")
for(i in 2:ncol(boot.cdf)) lines(xs, boot.cdf[, i], col=rgb(.6, .6, .6, .1))
# Add pointwise confidence bands
quants <- apply(boot.cdf, 1, quantile, c(0.025, 0.5, 0.975))
min.point <- apply(boot.cdf, 1, min, na.rm=TRUE)
max.point <- apply(boot.cdf, 1, max, na.rm=TRUE)
lines(xs, quants[1, ], col="red", lwd=1.5, lty=2)
lines(xs, quants[3, ], col="red", lwd=1.5, lty=2)
lines(xs, quants[2, ], col="darkred", lwd=2)
#lines(xs, min.point, col="purple")
#lines(xs, max.point, col="purple")
使用 GAMLSS 进行自动分布拟合
该gamlss
软件包R
提供了尝试许多不同分布并根据 GAIC(广义 Akaike 信息标准)选择“最佳”的能力。主要功能是fitDist
。此函数中的一个重要选项是尝试的分布类型。例如,设置type = "realline"
将尝试在整个实线上定义的所有已实现分布,而type = "realsplus"
仅尝试在实正线上定义的分布。另一个重要的选项是参数ķ,这是对 GAIC 的处罚。在下面的示例中,我设置了参数k=2这意味着根据经典 AIC 选择“最佳”分布。你可以设置k任何你喜欢的东西,比如log(n)为 BIC。
library(gamlss)
library(gamlss.dist)
library(gamlss.add)
x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)
fit <- fitDist(x, k = 2, type = "realplus", trace = FALSE, try.gamlss = TRUE)
summary(fit)
*******************************************************************
Family: c("WEI2", "Weibull type 2")
Call: gamlssML(formula = y, family = DIST[i], data = sys.parent())
Fitting method: "nlminb"
Coefficient(s):
Estimate Std. Error t value Pr(>|t|)
eta.mu -24.3468041 2.2141197 -10.9962 < 2.22e-16 ***
eta.sigma 1.8661380 0.0892799 20.9021 < 2.22e-16 ***
根据 AIC,Weibull 分布(更具体地说WEI2
,它的特殊参数化)最适合数据。分布的精确参数化在第 279 页的文档WEI2
中有详细说明。让我们通过查看蠕虫图(基本上是去趋势的 QQ 图)中的残差来检查拟合:
我们预计残差接近中间水平线,其中 95% 位于上虚线和下虚线之间,这相当于 95% 的逐点置信区间。在这种情况下,蠕虫图对我来说看起来很好,表明 Weibull 分布是合适的。