机器算法验证 - 什么是比较分布的良好数据可视化技术？ - 吾爱随笔录

什么是比较分布的良好数据可视化技术？

机器算法验证 r 分布数据可视化箱形图相对分布

2022-01-16 10:57:40

我正在写我的博士论文，我意识到我过度依赖箱线图来比较分布。您还喜欢哪些其他选择来完成这项任务？

我还想问你是否知道任何其他资源，例如 R 画廊，我可以在其中激发自己关于数据可视化的不同想法。

4个回答

正如@gung 所建议的那样，我将详细说明我的评论。为了完整起见，我还将包括@Alexander 建议的小提琴情节。其中一些工具可用于比较两个以上的样本。

# Required packages

library(sn)
library(aplpack)
library(vioplot)
library(moments)
library(beanplot)

# Simulate from a normal and skew-normal distributions
x = rnorm(250,0,1)
y = rsn(250,0,1,5)

# Separated histograms
hist(x)
hist(y)

# Combined histograms
hist(x, xlim=c(-4,4),ylim=c(0,1), col="red",probability=T)
hist(y, add=T, col="blue",probability=T)

# Boxplots
boxplot(x,y)

# Separated smoothed densities
plot(density(x))
plot(density(y))

# Combined smoothed densities
plot(density(x),type="l",col="red",ylim=c(0,1),xlim=c(-4,4))
points(density(y),type="l",col="blue")

# Stem-and-leaf plots
stem(x)
stem(y)

# Back-to-back stem-and-leaf plots
stem.leaf.backback(x,y)

# Violin plot (suggested by Alexander)
vioplot(x,y)

# QQ-plot
qqplot(x,y,xlim=c(-4,4),ylim=c(-4,4))
qqline(x,y,col="red")

# Kolmogorov-Smirnov test
ks.test(x,y)

# six-numbers summary
summary(x)
summary(y)

# moment-based summary
c(mean(x),var(x),skewness(x),kurtosis(x))
c(mean(y),var(y),skewness(y),kurtosis(y))

# Empirical ROC curve
xx = c(-Inf, sort(unique(c(x,y))), Inf)
sens = sapply(xx, function(t){mean(x >= t)})
spec = sapply(xx, function(t){mean(y < t)})

plot(0, 0, xlim = c(0, 1), ylim = c(0, 1), type = 'l')
segments(0, 0, 1, 1, col = 1)
lines(1 - spec, sens, type = 'l', col = 2, lwd = 1)

# Beanplots
beanplot(x,y)

# Empirical CDF
plot(ecdf(x))
lines(ecdf(y))

我希望这有帮助。

在对您的建议进行了更多探索之后，我发现这种情节可以补充@Procastinator 的答案。它被称为“蜂群”，是箱形图和小提琴图的混合，其细节级别与散点图相同。

蜂群 R 包

蜂群图示例

这是来自 Nathan Yau 的 Flowing Data 博客的一个很好的教程，它使用 R 和美国的州级犯罪数据。表明：

盒须图（您已经使用过）
直方图
核密度图
地毯地块
小提琴情节
Bean Plots（箱形图、密度图的奇怪组合，中间有一块地毯）。

最近，我发现自己绘制 CDF 比绘制直方图要多得多。

一张纸条：

您想回答有关数据的问题，而不是创建有关可视化方法本身的问题。通常，无聊更好。它确实使比较的比较也更容易理解。

答案：

除了 R 的基本包之外，对简单格式的需求可能解释了 Hadley 的 ggplot 包在 R 中的流行。

library(sn)
library(ggplot2)

# Simulate from a normal and skew-normal distributions
x = rnorm(250,0,1)
y = rsn(250,0,1,5)


##============================================================================
## I put the data into a data frame for ease of use
##============================================================================

dat = data.frame(x,y=y[1:250]) ## y[1:250] is used to remove attributes of y
str(dat)
dat = stack(dat)
str(dat)

##============================================================================
## Density plots with ggplot2
##============================================================================
ggplot(dat, 
     aes(x=values, fill=ind, y=..scaled..)) +
        geom_density() +
        opts(title = "Some Example Densities") +
        opts(plot.title = theme_text(size = 20, colour = "Black"))

ggplot(dat, 
     aes(x=values, fill=ind, y=..scaled..)) +
        geom_density() +
        facet_grid(ind ~ .) +
        opts(title = "Some Example Densities \n Faceted") +
        opts(plot.title = theme_text(size = 20, colour = "Black"))

ggplot(dat, 
     aes(x=values, fill=ind)) +
        geom_density() +
        facet_grid(ind ~ .) +
        opts(title = "Some Densities \n This time without \"scaled\" ") +
        opts(plot.title = theme_text(size = 20, colour = "Black"))

##----------------------------------------------------------------------------
## You can do histograms in ggplot2 as well...
## but I don't think that you can get all the good stats 
## in a table, as with hist
## e.g. stats = hist(x)
##----------------------------------------------------------------------------
ggplot(dat, 
     aes(x=values, fill=ind)) +
        geom_histogram(binwidth=.1) +
        facet_grid(ind ~ .) +
        opts(title = "Some Example Histograms \n Faceted") +
        opts(plot.title = theme_text(size = 20, colour = "Black"))

## Note, I put in code to mimic the default "30 bins" setting
ggplot(dat, 
     aes(x=values, fill=ind)) +
        geom_histogram(binwidth=diff(range(dat$values))/30) +
        opts(title = "Some Example Histograms") +
        opts(plot.title = theme_text(size = 20, colour = "Black"))

最后，我发现添加一个简单的背景会有所帮助。这就是为什么我写了可以由 panel.first 调用的“bgfun”

bgfun = function (color="honeydew2", linecolor="grey45", addgridlines=TRUE) {
    tmp = par("usr")
    rect(tmp[1], tmp[3], tmp[2], tmp[4], col = color)
    if (addgridlines) {
        ylimits = par()$usr[c(3, 4)]
        abline(h = pretty(ylimits, 10), lty = 2, col = linecolor)
    }
}
plot(rnorm(100), panel.first=bgfun())

## Plot with original example data
op = par(mfcol=c(2,1))
hist(x, panel.first=bgfun(), col='antiquewhite1', main='Bases belonging to us')
hist(y, panel.first=bgfun(color='darkolivegreen2'), 
    col='antiquewhite2', main='Bases not belonging to us')
mtext( 'all your base are belong to us', 1, 4)
par(op)

其它你可能感兴趣的问题

上一篇（某些）伪随机化有什么问题下一篇R中的列矩阵归一化