当前位置：网站首页>Mathematics in machine learning -- point estimation (IV): maximum posteriori probability (map)

Mathematics in machine learning -- point estimation (IV): maximum posteriori probability (map)

2022-06-24 16:32:00 【von Neumann】

Catalogues ：《 Mathematics in machine learning 》 General catalogue
Related articles ：
· Point estimation （ One ）： Basic knowledge of
· Point estimation （ Two ）： Moment estimate
· Point estimation （ 3、 ... and ）： Maximum likelihood estimation / Maximum likelihood estimation （Maximum Likelihood Estimate,MLE）
· Point estimation （ Four ）： Maximum posterior estimate （Maximum Posteriori Probability,MAP）

In the previous article, we have discussed the frequency statistical method and the method based on estimating single value $\theta$ Methods , Then make all predictions based on this estimate . Another way is to consider all possible... When making predictions $\theta$ . The latter belongs to the category of Bayesian statistics . The perspective of the frequency school is the real parameter $\theta$ Is an unknown constant value , And point estimates $\hat{\theta}$ Is to consider the functions on the data set （ It can be regarded as random ） Random variable of .

The perspective of Bayesian statistics is completely different . Bayesian statistics uses probability to reflect the degree of certainty of knowledge state . Data sets can be directly observed , So it's not random . On the other hand , Real parameters $\theta$ Is unknown or uncertain , Therefore, it can be expressed as a random variable .

Before the data is observed , We will $\theta$ The known knowledge of is expressed as a priori probability distribution $p(\theta)$ . generally speaking , Machine learning practitioners will choose a fairly broad （ High entropy ） Prior distribution , To reflect the parameters before any data is observed $\theta$ A high degree of uncertainty . for example , We might assume a priori $\theta$ Uniformly distributed in a finite interval . Many priors prefer “ It's simpler ” Solution .

Now suppose we have a set of data samples $x_1, x_2, \dots, x_n$ , Combining Bayesian rules with data likelihood $p(x_1, x_2, \dots, x_n|\theta$ And a priori , Can recover data for us about $\theta$ The influence of faith ：
$p(x_1, x_2, \dots, x_n|\theta)=\frac{p(x_1, x_2, \dots, x_n|\theta)p(\theta)}{p(x_1, x_2, \dots, x_n)}$

In the context of Bayesian estimation , A priori starts with a relatively uniform distribution or a high entropy Gaussian distribution , The observed data usually reduce the posterior entropy , And focus on a few possible values of the parameter . Relative to maximum likelihood estimation , There are two important differences in Bayesian estimation ：

Unlike the maximum likelihood method used in prediction $\theta$ Point estimate of , Bayesian method uses $\theta$ The full distribution of . for example , In the observation of $n$ After a sample , Next data sample $x_{n+1}$ The predicted distribution of is as follows ： $p(x_{n+1}|x_1, x_2, \dots, x_n)=\int p(x_{n+1}|\theta)p(\theta|x_1, x_2, \dots, x_n)\text{d}\theta$ Each has a positive probability density $\theta$ The value of is helpful for the prediction of the next sample , The contribution is weighted by the posterior density itself . In the observed data set $x_1, x_2, \dots, x_n$ after , If we are still very uncertain $\theta$ Value , Then this uncertainty will be directly included in any forecast we make . In the previous article , We have explored the frequency method to solve the given point estimation $\theta$ The method of uncertainty is to evaluate variance , The estimated variance evaluates the observed data after re sampling from the observed data , Estimate how it might change . How to deal with the problem of estimation uncertainty , The Bayesian answer is integral , This tends to prevent over fitting . Of course , Integration is simply an application of the law of probability , Make Bayesian method easy to verify , And frequency machine learning builds an estimate based on fairly specific decisions , Generalize all the information in the data set into a single point estimate .
A priori can affect the shift of probability mass density towards the region of preference a priori in parameter space . In practice , A priori usually shows a preference for simpler or smoother models . The critique of Bayesian method holds that , A priori is the source of human subjective judgment affecting prediction .

When training data is limited , Bayesian methods are generally better generalized , But when the number of training samples is large , There is usually a great computational cost .

In principle, , We should use parameters $\theta$ The complete Bayesian posterior distribution is predicted , But single point estimation is often needed . A common reason to want to use point estimation is , For most meaningful models , Most calculations involving Bayesian posterior are very tricky , Point estimation provides a feasible approximate solution . We can still let the choice of prior influence point estimation take advantage of the advantages of Bayesian method , Instead of simply going back to maximum likelihood estimation . A reasonable way to do this is to choose the maximum a posteriori estimation . Maximum a posteriori estimation select the point with the maximum a posteriori probability ：
$\theta_{MAP}=\arg\max_{\theta}\log p(\theta|x)=\arg\max_{\theta}\log \frac{p(x|\theta)}{p(\theta)}$

Dexter $\log p(x|\theta)$ Corresponding to the standard log likelihood term , $\log p(\theta)$ Corresponding to a priori distribution .MAP The advantage of Bayesian inference is that it can use information from a priori , This information cannot be obtained from the training data . Relative to maximum likelihood estimation , This additional information helps to reduce the variance of the maximum a posteriori point estimation . However , This advantage comes at the expense of increased bias . Many normalized estimation methods , For example, the maximum likelihood learning of weight attenuation regularization , Can be interpreted as Bayesian inference MAP The approximate . This additional term added to the objective function during regularization corresponds to $\log p(\theta)$ . Not all regularization penalties correspond to MAP Bayesian inference . for example , Some regularization may not be the logarithm of a probability distribution . There are also regularizations that rely on data , Of course, it will not be a prior probability distribution .MAP Bayesian inference provides an intuitive way to design complex but interpretable regularization . for example , More complex penalty terms can be obtained by using Gaussian mixture distribution as a priori , Instead of a single Gaussian distribution .

原网站

版权声明
本文为[von Neumann]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202211603562220.html

当前位置：网站首页>Mathematics in machine learning -- point estimation (IV): maximum posteriori probability (map)

Mathematics in machine learning -- point estimation (IV): maximum posteriori probability (map)

边栏推荐

猜你喜欢

随机推荐