机器算法验证 - QQ plot and x=yx=y line - 吾爱随笔录

QQ plot and x=yx=y line

机器算法验证 r qq-plot

2022-03-25 17:22:10

If my qqplot is linear but the gradient is not the same as the 45 degrees line, what does this suggest?

I am trying to examine the fit of laplace distribution to my sample data, so I randomly generated laplace distributed (with parameters estimated from my sample) observations and plotted them against my sample:

qqplot(rand, sample)
abline(0, 1, col = 'red')

2个回答

Due to the lack of data in your question, I use the gaussian distribution vs. a sample in my answer below (instead of Laplace distribution vs. your sample data).

As far as the two first moments are concerned, the interpretation of what you see in the qq-plot is the following:

If the distributions are identical, you expect a line $x = y$
```
x <- rnorm(1000)
qqnorm(x)
abline(0, 1, col = 'red')
```

If the means are different, you expect an intercept $a \neq 0$ $x=y$
```
x <- rnorm(1000)
qqnorm(x + 1)
abline(0, 1, col = 'red')
```

If the standard deviations are different, you expect a slope $b \neq 1$
```
x <- rnorm(1000)
qqnorm(x * 1.5)
abline(0, 1, col = 'red')
```

To get the intuition of this, you can simply plot the CDFs in the same plot. For example, taking the last one:

lines(seq(-7, 7, by = 0.01), pnorm(seq(-7, 7, by = 0.01)), col = 'red')

Let's take for example 3 points in the y-axis: $CDF(q) = 0.2$ $0.5$ $0.8$ $q$

You can see that:

\begin{aligned} F_{r e d}^{- 1} (0.2) & > F_{X}^{- 1} (0.2) (quantile around -1) \\ F_{r e d}^{- 1} (0.5) & = F_{X}^{- 1} (0.5) (quantile = 0) \\ F_{r e d}^{- 1} (0.8) & < F_{X}^{- 1} (0.8) (quantile around 1) \end{aligned}

$\begin{aligned} F^{-1}_{red}(0.2) &> F^{-1}_X(0.2) \text{ (quantile around -1)} \\ F^{-1}_{red}(0.5) &= F^{-1}_X(0.5) \text{ (quantile = 0)}\\ F^{-1}_{red}(0.8) &< F^{-1}_X(0.8) \text{ (quantile around 1)} \end{aligned}$

Which is what's shown by the qq-plot.

The linearity of the QQ-plot only suggests that your sample follows a normal distribution (or more specifically, it's quantile function is the probit function). The slope is determined by the standard deviation (for sd=1, we get the popular $x=y$ line).

An S-shaped plot, something which seems symmetrical across 180-degree rotation is indicative of a symmetric distribution.

An intuitive reasoning for the shape is thus; to get a straight line, you need a similar scaling of the spacing of the quantiles around the mean. Meaning that if say $x^{th}$ quantile is some proportion of distance from the mean when compared to $y^{th}$ quantile, the proportion is conserved, which is only conserved in case of a normal distribution. The slope is more indicative of the absolute magnitude of this proportion, hence depends on the sd. Different shapes can be reasoned out in a similar way, by looking at this proportion at different places along the distribution.

Here are some visualisations.

Note: I am plotting the sample on the Y-axis as is the norm, and I am assuming that the way you have plotted puts the sample on the x axis.

R-code:

# Creating different distributions with mean 0
library(rmutil)
set.seed(12345)
normald<-rnorm(10000,sd=2)
normald<-(normald-mean(normald))/sd(normald)
sharperpeak<-rlaplace(10000) #using Laplace distribution
sharperpeak<-(sharperpeak-mean(sharperpeak))/sd(sharperpeak)
heavytail<-rt(10000,5) #using t-distribution
heavytail<-(heavytail-mean(heavytail))/sd(heavytail)
positiveskew<-rlnorm(10000) #using lognormal distribution
positiveskew<-(positiveskew-mean(positiveskew))/sd(positiveskew)
negativeskew<-positiveskew*(-1) #shortcut
negativeskew<-(negativeskew-mean(negativeskew))/sd(negativeskew)

library(ggplot2)
library(gridExtra)

#normal plot
p1<-ggplot(data.frame(dt=normald))+geom_density(aes(x=dt),fill='green',alpha=0.6)+xlab('Normal Distribution')+geom_vline(xintercept=quantile(normald,c(0.25,0.75),color='red',alpha=0.3))
p2<-ggplot(data.frame(dt=normald))+geom_qq(aes(sample=dt))+geom_abline(slope=1,intercept = 0)
grid.arrange(p1,p2,nrow=1)

#sharppeak plot
p1<-ggplot(data.frame(dt=sharperpeak))+geom_density(aes(x=dt),fill='green',alpha=0.6)+xlab('Sharper-peaks')+geom_vline(xintercept=quantile(sharperpeak,c(0.25,0.75),color='red',alpha=0.3))
p2<-ggplot(data.frame(dt=sharperpeak))+geom_qq(aes(sample=dt))+geom_abline(slope=1,intercept = 0)
grid.arrange(p1,p2,nrow=1)

#heaviertails plot
p1<-ggplot(data.frame(dt=heavytail))+geom_density(aes(x=dt),fill='green',alpha=0.6)+xlab('Heavy Tails')+geom_vline(xintercept=quantile(heavytail,c(0.25,0.75),color='red',alpha=0.3))
p2<-ggplot(data.frame(dt=heavytail))+geom_qq(aes(sample=dt))+geom_abline(slope=1,intercept = 0)
grid.arrange(p1,p2,nrow=1)

#positiveskew plot
p1<-ggplot(data.frame(dt=positiveskew))+geom_density(aes(x=dt),fill='green',alpha=0.6)+xlab('Positively skewed Distribution')+geom_vline(xintercept=quantile(positiveskew,c(0.25,0.75),color='red',alpha=0.3))+xlim(-1.5,5)
p2<-ggplot(data.frame(dt=positiveskew))+geom_qq(aes(sample=dt))+geom_abline(slope=1,intercept = 0)
grid.arrange(p1,p2,nrow=1)

#negative skew plot
p1<-ggplot(data.frame(dt=negativeskew))+geom_density(aes(x=dt),fill='green',alpha=0.6)+xlab('Negatively skewed Distribution')+geom_vline(xintercept=quantile(negativeskew,c(0.25,0.75),color='red',alpha=0.3))+xlim(-5,1.5)
p2<-ggplot(data.frame(dt=negativeskew))+geom_qq(aes(sample=dt))+geom_abline(slope=1,intercept = 0)
grid.arrange(p1,p2,nrow=1)

# Normal distributions with different sds
normal1<-rnorm(3000,sd=2)
normal2<-rnorm(3000,sd=4)
normal3<-rnorm(3000,sd=0.5)
normal4<-rnorm(3000,sd=0.25)
final<-c(normal1,normal2,normal3,normal4)
ggplot(data.frame(dt=final,sds=factor(rep(c('2','4','0.5','0.25'),each=3000))),aes(sample=dt,color=sds))+geom_qq()+geom_abline(slope=1,intercept=0)

其它你可能感兴趣的问题

上一篇在 CNN 中，我们是否在每个卷积层都学习了内核值？下一篇如果均方误差 = 方差 + 偏差^2。那么均方误差如何低于方差