机器算法验证 - 如何在一个图中绘制拟合图和伽马分布的实际图？ - 吾爱随笔录

如何在一个图中绘制拟合图和伽马分布的实际图？

机器算法验证 r 数理统计拟合优度伽马分布 ggplot2

2022-03-06 19:22:44

加载所需的包。

library(ggplot2)
library(MASS)

生成 10,000 个适合 gamma 分布的数字。

x <- round(rgamma(100000,shape = 2,rate = 0.2),1)
x <- x[which(x>0)]

画出概率密度函数，假设我们不知道 x 适合哪个分布。

t1 <- as.data.frame(table(x))
names(t1) <- c("x","y")
t1 <- transform(t1,x=as.numeric(as.character(x)))
t1$y <- t1$y/sum(t1[,2])
ggplot() + 
  geom_point(data = t1,aes(x = x,y = y)) + 
  theme_classic()

从图中可以看出x的分布很像gamma分布，所以我们用fitdistr()in packageMASS来获取gamma分布的shape和rate参数。

fitdistr(x,"gamma") 
##       output 
##       shape           rate    
##   2.0108224880   0.2011198260 
##  (0.0083543575) (0.0009483429)

在同一个图中绘制实际点（黑点）和拟合图（红线），这是问题，请先看图。

ggplot() + 
  geom_point(data = t1,aes(x = x,y = y)) +     
  geom_line(aes(x=t1[,1],y=dgamma(t1[,1],2,0.2)),color="red") + 
  theme_classic()

我有两个问题：

真正的参数是shape=2, rate=0.2，我使用函数fitdistr()获取的参数是shape=2.01, rate=0.20。这两个几乎一样，但是为什么拟合图不能很好地拟合实际点，肯定是拟合图有问题，或者我绘制拟合图和实际点的方式完全错误，我该怎么办?
在我得到我建立的模型的参数后，我以哪种方式评估模型，比如线性模型的 RSS（残差平方和），或者的 p 值shapiro.test()，ks.test()以及其他测试？

我的统计知识很差，你能帮帮我吗？

ps：我在谷歌、stackoverflow和CV中搜索了很多次，但没有找到与这个问题相关的内容

2个回答

问题 1

手动计算密度的方式似乎是错误的。无需对伽马分布中的随机数进行四舍五入。正如@Pascal 所指出的，您可以使用直方图来绘制点的密度。在下面的示例中，我使用该函数density来估计密度并将其绘制为点。我用点和直方图呈现拟合：

library(ggplot2)
library(MASS)

# Generate gamma rvs

x <- rgamma(100000, shape = 2, rate = 0.2)

den <- density(x)

dat <- data.frame(x = den$x, y = den$y)

# Plot density as points

ggplot(data = dat, aes(x = x, y = y)) + 
  geom_point(size = 3) +
  theme_classic()

# Fit parameters (to avoid errors, set lower bounds to zero)

fit.params <- fitdistr(x, "gamma", lower = c(0, 0))

# Plot using density points

ggplot(data = dat, aes(x = x,y = y)) + 
  geom_point(size = 3) +     
  geom_line(aes(x=dat$x, y=dgamma(dat$x,fit.params$estimate["shape"], fit.params$estimate["rate"])), color="red", size = 1) + 
  theme_classic()

# Plot using histograms

ggplot(data = dat) +
  geom_histogram(data = as.data.frame(x), aes(x=x, y=..density..)) +
  geom_line(aes(x=dat$x, y=dgamma(dat$x,fit.params$estimate["shape"], fit.params$estimate["rate"])), color="red", size = 1) + 
  theme_classic()

这是@Pascal 提供的解决方案：

h <- hist(x, 1000, plot = FALSE)
t1 <- data.frame(x = h$mids, y = h$density)

ggplot(data = t1, aes(x = x, y = y)) + 
  geom_point(size = 3) +     
  geom_line(aes(x=t1$x, y=dgamma(t1$x,fit.params$estimate["shape"], fit.params$estimate["rate"])), color="red", size = 1) + 
  theme_classic()

问题2

为了评估合身度，我推荐了这个包fitdistrplus。以下是如何使用它来拟合两个分布并以图形和数字方式比较它们的拟合。该命令gofstat打印出几个度量，例如 AIC、BIC 和一些 gof-statistics，例如 KS-Test 等。这些主要用于比较不同分布的拟合（在本例中为 gamma 与 Weibull）。更多信息可以在我的回答中找到：

library(fitdistrplus)

x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
       38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
       42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
       49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
       45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
       36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
       38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)

fit.weibull <- fitdist(x, "weibull")
fit.gamma <- fitdist(x, "gamma", lower = c(0, 0))

# Compare fits 

graphically

par(mfrow = c(2, 2))
plot.legend <- c("Weibull", "Gamma")
denscomp(list(fit.weibull, fit.gamma), fitcol = c("red", "blue"), legendtext = plot.legend)
qqcomp(list(fit.weibull, fit.gamma), fitcol = c("red", "blue"), legendtext = plot.legend)
cdfcomp(list(fit.weibull, fit.gamma), fitcol = c("red", "blue"), legendtext = plot.legend)
ppcomp(list(fit.weibull, fit.gamma), fitcol = c("red", "blue"), legendtext = plot.legend)

@NickCox 正确地建议 QQ-Plot（右上图）是判断和比较拟合的最佳单一图表。拟合密度很难比较。为了完整起见，我也包括了其他图形。

# Compare goodness of fit

gofstat(list(fit.weibull, fit.gamma))

Goodness-of-fit statistics
                             1-mle-weibull 2-mle-gamma
Kolmogorov-Smirnov statistic    0.06863193   0.1204876
Cramer-von Mises statistic      0.05673634   0.2060789
Anderson-Darling statistic      0.38619340   1.2031051

Goodness-of-fit criteria
                               1-mle-weibull 2-mle-gamma
Aikake's Information Criterion      519.8537    531.5180
Bayesian Information Criterion      524.5151    536.1795

除了上面提供的答案，但使用密度并添加一些主题金光闪闪：

library(fitdistrplus)
library(ggplot2)
library(ggthemes)

fit.gamma <- fitdist(df$y, "gamma", lower=c(0,0), start=list(scale=1,shape=1))
gammas<-round(rgamma(nrow(df), shape = fit.gamma$estimate["shape"], scale = fit.gamma$estimate["scale"]), 1)

gammas %>%
  as_tibble() %>%
  ggplot(aes(value)) +
  geom_histogram(stat = "density") +
  stat_function(fun = function(x) {dgamma(x, fit.gamma$estimate["shape"],scale=fit.gamma$estimate["scale"])}, color = "red") +
  lims(x = c(0, 15)) +
  theme_economist() +
  scale_color_economist()

其它你可能感兴趣的问题

上一篇Fisher 检验的假设“行和列的总数应该是固定的”是什么意思？下一篇为什么在使用 MAP 估计参数时需要 MCMC