机器算法验证 - Why do people use L(θ|x)L(θ|x) for likelihood instead of P(x|θ)P(x|θ)? - 吾爱随笔录

Why do people use L(θ|x)L(θ|x) for likelihood instead of P(x|θ)P(x|θ)?

机器算法验证 likelihood notation

2022-01-28 16:05:43

According to the Wikipedia article Likelihood function, the likelihood function is defined as:

L (θ | x) = P (x | θ),

$\mathcal{L}(\theta|x)=P(x|\theta),$

with parameters $\theta$ and observed data $x$ . This equals $p(x|\theta)$ or $p_\theta(x)$ depending on notation and whether $\theta$ is treated as random variable or fixed value.

The notation $\mathcal{L}(\theta|x)$ seems like an unnecessary abstraction to me. Is there any benefit to using $\mathcal{L}(\theta|x)$ , or could one equivalently use $P(x|\theta)$ ? Why was $\mathcal{L}(\theta|x)$ introduced?

4个回答

Likelihood is a function of $\theta$ , given $x$ , while $P$ is a function of $x$ , given $\theta$ .

The likelihood function is not a density (or pmf) -- it doesn't integrate (/sum) to 1.
Indeed, $\mathcal L$ may be continuous while $P$ is discrete (e.g. likelihood for a binomial parameter) or vice-versa (e.g. likelihood for an Erlang distribution with unit rate parameter but unspecified shape)

Imagine a bivariate function of a single potential observation $x$ (say a Poisson count) and a single parameter (e.g. $\lambda$ ) -- in this example discrete in $x$ and continuous in $\lambda$ -- then when you slice that bivariate function of $(x,\lambda)$ one way you get $p_\lambda(x)$ (each slice gives a different pmf) and when you slice it the other way you get $\mathcal L_x(\lambda)$ (each a different likelihood function).

(That bivariate function simply expresses the way $x$ and $\lambda$ are related via your model)

[Alternatively, consider a discrete $\theta$ and a continuous $x$ ; here the likelihood is discrete and the density continuous.]

As soon as you specify $x$ , you identify a particular $\mathcal L$ , that we call the likelihood function of that sample. It tells you about $\theta$ for that sample -- in particular what values had more or less likelihood of giving that sample.

Likelihood is a function that tell you about the relative chance (in that ratios of likelihoods can be thought of as ratios of probabilities of being in $x+dx$ ) that this value of $\theta$ could produce your data.

According to the Bayesian theory, $f(\theta|x_1,...,x_n) = \frac{f(x_1,...,x_n|\theta) * f(\theta)}{f(x_1,...,x_n)}$ holds, that is $\text{posterior} = \frac{\text{likelihood} * \text{prior}}{evidence}$ .

Notice that the maximum likelihood estimate omits the prior beliefs(or defaults it to zero-mean Gaussian and count on it as the L2 regularization or weight decay) and treats the evidence as constant(when calculating the partial derivative with respect to $\theta$ ).

It tries to maximize the likelihood by adjusting $\theta$ and just treating $f(\theta|x_1,...,x_n)$ equal to $f(x_1,...,x_n|\theta)$ which we can easily get(usually the loss) and keep the likelihood as $\mathcal{L}(\theta|\mathbf x)$ . The true probability $\frac{f(x_1,...,x_n|\theta) * f(\theta)}{f(x_1,...,x_n)}$ can hardly be worked out because the evidence(the denominator), $\int_{\theta} f(x_1, ...,x_n, \theta)d\theta$ , is intractable.

Hope this helps.

I agree with @Big Agnes. Here is what my professor taught in class: One way is to think of likelihood function $L(\theta | \mathbf{x})$ as a random function which depends on the data. Different data gives us different likelihood functions. So you may say conditioning on data. Given a realization of data, we want to find a $\hat{\theta}$ such that $L(\theta | \mathbf{x})$ is maximized or you can say $\hat{\theta}$ is most consistent with data. This is same to say we maximize "observed probability" $P (\mathbf{x} | \theta)$ . We use $P(\mathbf{x} | \theta)$ to do calculation but it is different from $P(\mathbf{X} | \theta)$ . Small $\mathbf{x}$ stands for observed values, while $\mathbf{X}$ stands for random variable. If you know $\theta$ , then $P(\mathbf{x} | \theta)$ is the probability/ density of observing $\mathbf{x}$ .

I think the other answers given by jwyao and Glen_b are quite good. I just wanted to add a very simple example which is too long for a comment.

Consider one observation $X$ from a Bernoulli distribution with probability of success $\theta$ . With $\theta$ fixed (known or unknown), the distribution of $X$ is given by $p(X|\theta)$ .

P (x | θ) = θ^{x} (1 - θ)^{1 - x}

$P(x|\theta) = \theta^x(1-\theta)^{1-x}$

In other words, we know that $P(X=1) = 1 - P(X=0) = \theta$ .

Alternatively, we could look treat the observation as fixed and view this as a function of $\theta$ .

L (θ | x) = θ^{x} (1 - θ)^{1 - x}

$L(\theta | x) = \theta^x(1-\theta)^{1-x}$

For example, in a maximum likelihood setting, we seek to find $\theta$ which maximizes the likelihood as a function of $\theta$ . For example, if we observe $X = 1$ , then the likelihood becomes

L (θ | x) = {\begin{cases} θ, & θ \in [0, 1] \\ 0, & e l s e \end{cases}

$L(\theta | x) = \begin{cases} \theta, & \theta \in [0,1] \\ 0, & else \end{cases}$

and we see that the MLE would be $\hat\theta = 1$ .

Not sure that I've really added any value to the discussion, but I just wanted to give a simple example of the different ways of viewing the same function.

其它你可能感兴趣的问题

上一篇何时在神经网络中“添加”层以及何时“连接”？下一篇准最大似然估计（QMLE）背后的思想和直觉