QQ plot and x=yx=y line

机器算法验证 r qq-plot
2022-03-25 17:22:10

If my qqplot is linear but the gradient is not the same as the 45 degrees line, what does this suggest?QQplot

I am trying to examine the fit of laplace distribution to my sample data, so I randomly generated laplace distributed (with parameters estimated from my sample) observations and plotted them against my sample:

qqplot(rand, sample)
abline(0, 1, col = 'red')
2个回答

Due to the lack of data in your question, I use the gaussian distribution vs. a sample in my answer below (instead of Laplace distribution vs. your sample data).

As far as the two first moments are concerned, the interpretation of what you see in the qq-plot is the following:

  • If the distributions are identical, you expect a line x=y

    x <- rnorm(1000)
    qqnorm(x)
    abline(0, 1, col = 'red')
    

enter image description here

  • If the means are different, you expect an intercept a0x=y

    x <- rnorm(1000)
    qqnorm(x + 1)
    abline(0, 1, col = 'red')
    

enter image description here

  • If the standard deviations are different, you expect a slope b1

    x <- rnorm(1000)
    qqnorm(x * 1.5)
    abline(0, 1, col = 'red')
    

enter image description here

To get the intuition of this, you can simply plot the CDFs in the same plot. For example, taking the last one:

lines(seq(-7, 7, by = 0.01), pnorm(seq(-7, 7, by = 0.01)), col = 'red')

Let's take for example 3 points in the y-axis: CDF(q)=0.20.50.8q

You can see that:

Fred1(0.2)>FX1(0.2) (quantile around -1)Fred1(0.5)=FX1(0.5) (quantile = 0)Fred1(0.8)<FX1(0.8) (quantile around 1)

Which is what's shown by the qq-plot.

enter image description here

The linearity of the QQ-plot only suggests that your sample follows a normal distribution (or more specifically, it's quantile function is the probit function). The slope is determined by the standard deviation (for sd=1, we get the popular x=y line).

An S-shaped plot, something which seems symmetrical across 180-degree rotation is indicative of a symmetric distribution.

An intuitive reasoning for the shape is thus; to get a straight line, you need a similar scaling of the spacing of the quantiles around the mean. Meaning that if say xth quantile is some proportion of distance from the mean when compared to yth quantile, the proportion is conserved, which is only conserved in case of a normal distribution. The slope is more indicative of the absolute magnitude of this proportion, hence depends on the sd. Different shapes can be reasoned out in a similar way, by looking at this proportion at different places along the distribution.

Here are some visualisations.

Note: I am plotting the sample on the Y-axis as is the norm, and I am assuming that the way you have plotted puts the sample on the x axis.

enter image description here enter image description here enter image description here enter image description here enter image description here enter image description here

R-code:

# Creating different distributions with mean 0
library(rmutil)
set.seed(12345)
normald<-rnorm(10000,sd=2)
normald<-(normald-mean(normald))/sd(normald)
sharperpeak<-rlaplace(10000) #using Laplace distribution
sharperpeak<-(sharperpeak-mean(sharperpeak))/sd(sharperpeak)
heavytail<-rt(10000,5) #using t-distribution
heavytail<-(heavytail-mean(heavytail))/sd(heavytail)
positiveskew<-rlnorm(10000) #using lognormal distribution
positiveskew<-(positiveskew-mean(positiveskew))/sd(positiveskew)
negativeskew<-positiveskew*(-1) #shortcut
negativeskew<-(negativeskew-mean(negativeskew))/sd(negativeskew)

library(ggplot2)
library(gridExtra)

#normal plot
p1<-ggplot(data.frame(dt=normald))+geom_density(aes(x=dt),fill='green',alpha=0.6)+xlab('Normal Distribution')+geom_vline(xintercept=quantile(normald,c(0.25,0.75),color='red',alpha=0.3))
p2<-ggplot(data.frame(dt=normald))+geom_qq(aes(sample=dt))+geom_abline(slope=1,intercept = 0)
grid.arrange(p1,p2,nrow=1)

#sharppeak plot
p1<-ggplot(data.frame(dt=sharperpeak))+geom_density(aes(x=dt),fill='green',alpha=0.6)+xlab('Sharper-peaks')+geom_vline(xintercept=quantile(sharperpeak,c(0.25,0.75),color='red',alpha=0.3))
p2<-ggplot(data.frame(dt=sharperpeak))+geom_qq(aes(sample=dt))+geom_abline(slope=1,intercept = 0)
grid.arrange(p1,p2,nrow=1)

#heaviertails plot
p1<-ggplot(data.frame(dt=heavytail))+geom_density(aes(x=dt),fill='green',alpha=0.6)+xlab('Heavy Tails')+geom_vline(xintercept=quantile(heavytail,c(0.25,0.75),color='red',alpha=0.3))
p2<-ggplot(data.frame(dt=heavytail))+geom_qq(aes(sample=dt))+geom_abline(slope=1,intercept = 0)
grid.arrange(p1,p2,nrow=1)

#positiveskew plot
p1<-ggplot(data.frame(dt=positiveskew))+geom_density(aes(x=dt),fill='green',alpha=0.6)+xlab('Positively skewed Distribution')+geom_vline(xintercept=quantile(positiveskew,c(0.25,0.75),color='red',alpha=0.3))+xlim(-1.5,5)
p2<-ggplot(data.frame(dt=positiveskew))+geom_qq(aes(sample=dt))+geom_abline(slope=1,intercept = 0)
grid.arrange(p1,p2,nrow=1)

#negative skew plot
p1<-ggplot(data.frame(dt=negativeskew))+geom_density(aes(x=dt),fill='green',alpha=0.6)+xlab('Negatively skewed Distribution')+geom_vline(xintercept=quantile(negativeskew,c(0.25,0.75),color='red',alpha=0.3))+xlim(-5,1.5)
p2<-ggplot(data.frame(dt=negativeskew))+geom_qq(aes(sample=dt))+geom_abline(slope=1,intercept = 0)
grid.arrange(p1,p2,nrow=1)

# Normal distributions with different sds
normal1<-rnorm(3000,sd=2)
normal2<-rnorm(3000,sd=4)
normal3<-rnorm(3000,sd=0.5)
normal4<-rnorm(3000,sd=0.25)
final<-c(normal1,normal2,normal3,normal4)
ggplot(data.frame(dt=final,sds=factor(rep(c('2','4','0.5','0.25'),each=3000))),aes(sample=dt,color=sds))+geom_qq()+geom_abline(slope=1,intercept=0)