Why do people use L(θ|x)L(θ|x) for likelihood instead of P(x|θ)P(x|θ)?

机器算法验证 likelihood notation
2022-01-28 16:05:43

According to the Wikipedia article Likelihood function, the likelihood function is defined as:

L(θ|x)=P(x|θ),

with parameters θ and observed data x. This equals p(x|θ) or pθ(x) depending on notation and whether θ is treated as random variable or fixed value.

The notation L(θ|x) seems like an unnecessary abstraction to me. Is there any benefit to using L(θ|x), or could one equivalently use P(x|θ)? Why was L(θ|x) introduced?

4个回答

Likelihood is a function of θ, given x, while P is a function of x, given θ.

  • The likelihood function is not a density (or pmf) -- it doesn't integrate (/sum) to 1.

  • Indeed, L may be continuous while P is discrete (e.g. likelihood for a binomial parameter) or vice-versa (e.g. likelihood for an Erlang distribution with unit rate parameter but unspecified shape)

Imagine a bivariate function of a single potential observation x (say a Poisson count) and a single parameter (e.g. λ) -- in this example discrete in x and continuous in λ -- then when you slice that bivariate function of (x,λ) one way you get pλ(x) (each slice gives a different pmf) and when you slice it the other way you get Lx(λ) (each a different likelihood function).

(That bivariate function simply expresses the way x and λ are related via your model)

[Alternatively, consider a discrete θ and a continuous x; here the likelihood is discrete and the density continuous.]

As soon as you specify x, you identify a particular L, that we call the likelihood function of that sample. It tells you about θ for that sample -- in particular what values had more or less likelihood of giving that sample.

Likelihood is a function that tell you about the relative chance (in that ratios of likelihoods can be thought of as ratios of probabilities of being in x+dx) that this value of θ could produce your data.

According to the Bayesian theory, f(θ|x1,...,xn)=f(x1,...,xn|θ)f(θ)f(x1,...,xn) holds, that is posterior=likelihoodpriorevidence.

Notice that the maximum likelihood estimate omits the prior beliefs(or defaults it to zero-mean Gaussian and count on it as the L2 regularization or weight decay) and treats the evidence as constant(when calculating the partial derivative with respect to θ).

It tries to maximize the likelihood by adjusting θ and just treating f(θ|x1,...,xn) equal to f(x1,...,xn|θ) which we can easily get(usually the loss) and keep the likelihood as L(θ|x). The true probability f(x1,...,xn|θ)f(θ)f(x1,...,xn) can hardly be worked out because the evidence(the denominator), θf(x1,...,xn,θ)dθ, is intractable.

Hope this helps.

I agree with @Big Agnes. Here is what my professor taught in class: One way is to think of likelihood function L(θ|x) as a random function which depends on the data. Different data gives us different likelihood functions. So you may say conditioning on data. Given a realization of data, we want to find a θ^ such that L(θ|x) is maximized or you can say θ^ is most consistent with data. This is same to say we maximize "observed probability" P(x|θ). We use P(x|θ) to do calculation but it is different from P(X|θ). Small x stands for observed values, while X stands for random variable. If you know θ, then P(x|θ) is the probability/ density of observing x.

I think the other answers given by jwyao and Glen_b are quite good. I just wanted to add a very simple example which is too long for a comment.

Consider one observation X from a Bernoulli distribution with probability of success θ. With θ fixed (known or unknown), the distribution of X is given by p(X|θ).

P(x|θ)=θx(1θ)1x

In other words, we know that P(X=1)=1P(X=0)=θ.

Alternatively, we could look treat the observation as fixed and view this as a function of θ.

L(θ|x)=θx(1θ)1x

For example, in a maximum likelihood setting, we seek to find θ which maximizes the likelihood as a function of θ. For example, if we observe X=1, then the likelihood becomes

L(θ|x)={θ,θ[0,1]0,else

and we see that the MLE would be θ^=1.

Not sure that I've really added any value to the discussion, but I just wanted to give a simple example of the different ways of viewing the same function.