当前位置:网站首页>Mathematics in machine learning -- point estimation (IV): maximum posteriori probability (map)
Mathematics in machine learning -- point estimation (IV): maximum posteriori probability (map)
2022-06-24 16:32:00 【von Neumann】
Catalogues :《 Mathematics in machine learning 》 General catalogue
Related articles :
· Point estimation ( One ): Basic knowledge of
· Point estimation ( Two ): Moment estimate
· Point estimation ( 3、 ... and ): Maximum likelihood estimation / Maximum likelihood estimation (Maximum Likelihood Estimate,MLE)
· Point estimation ( Four ): Maximum posterior estimate (Maximum Posteriori Probability,MAP)
In the previous article, we have discussed the frequency statistical method and the method based on estimating single value θ \theta θ Methods , Then make all predictions based on this estimate . Another way is to consider all possible... When making predictions θ \theta θ. The latter belongs to the category of Bayesian statistics . The perspective of the frequency school is the real parameter θ \theta θ Is an unknown constant value , And point estimates θ ^ \hat{\theta} θ^ Is to consider the functions on the data set ( It can be regarded as random ) Random variable of .
The perspective of Bayesian statistics is completely different . Bayesian statistics uses probability to reflect the degree of certainty of knowledge state . Data sets can be directly observed , So it's not random . On the other hand , Real parameters θ \theta θ Is unknown or uncertain , Therefore, it can be expressed as a random variable .
Before the data is observed , We will θ \theta θ The known knowledge of is expressed as a priori probability distribution p ( θ ) p(\theta) p(θ). generally speaking , Machine learning practitioners will choose a fairly broad ( High entropy ) Prior distribution , To reflect the parameters before any data is observed θ \theta θ A high degree of uncertainty . for example , We might assume a priori θ \theta θ Uniformly distributed in a finite interval . Many priors prefer “ It's simpler ” Solution .
Now suppose we have a set of data samples x 1 , x 2 , … , x n x_1, x_2, \dots, x_n x1,x2,…,xn, Combining Bayesian rules with data likelihood p ( x 1 , x 2 , … , x n ∣ θ p(x_1, x_2, \dots, x_n|\theta p(x1,x2,…,xn∣θ And a priori , Can recover data for us about θ \theta θ The influence of faith :
p ( x 1 , x 2 , … , x n ∣ θ ) = p ( x 1 , x 2 , … , x n ∣ θ ) p ( θ ) p ( x 1 , x 2 , … , x n ) p(x_1, x_2, \dots, x_n|\theta)=\frac{p(x_1, x_2, \dots, x_n|\theta)p(\theta)}{p(x_1, x_2, \dots, x_n)} p(x1,x2,…,xn∣θ)=p(x1,x2,…,xn)p(x1,x2,…,xn∣θ)p(θ)
In the context of Bayesian estimation , A priori starts with a relatively uniform distribution or a high entropy Gaussian distribution , The observed data usually reduce the posterior entropy , And focus on a few possible values of the parameter . Relative to maximum likelihood estimation , There are two important differences in Bayesian estimation :
- Unlike the maximum likelihood method used in prediction θ \theta θ Point estimate of , Bayesian method uses θ \theta θ The full distribution of . for example , In the observation of n n n After a sample , Next data sample x n + 1 x_{n+1} xn+1 The predicted distribution of is as follows : p ( x n + 1 ∣ x 1 , x 2 , … , x n ) = ∫ p ( x n + 1 ∣ θ ) p ( θ ∣ x 1 , x 2 , … , x n ) d θ p(x_{n+1}|x_1, x_2, \dots, x_n)=\int p(x_{n+1}|\theta)p(\theta|x_1, x_2, \dots, x_n)\text{d}\theta p(xn+1∣x1,x2,…,xn)=∫p(xn+1∣θ)p(θ∣x1,x2,…,xn)dθ Each has a positive probability density θ \theta θ The value of is helpful for the prediction of the next sample , The contribution is weighted by the posterior density itself . In the observed data set x 1 , x 2 , … , x n x_1, x_2, \dots, x_n x1,x2,…,xn after , If we are still very uncertain θ \theta θ Value , Then this uncertainty will be directly included in any forecast we make . In the previous article , We have explored the frequency method to solve the given point estimation θ \theta θ The method of uncertainty is to evaluate variance , The estimated variance evaluates the observed data after re sampling from the observed data , Estimate how it might change . How to deal with the problem of estimation uncertainty , The Bayesian answer is integral , This tends to prevent over fitting . Of course , Integration is simply an application of the law of probability , Make Bayesian method easy to verify , And frequency machine learning builds an estimate based on fairly specific decisions , Generalize all the information in the data set into a single point estimate .
- A priori can affect the shift of probability mass density towards the region of preference a priori in parameter space . In practice , A priori usually shows a preference for simpler or smoother models . The critique of Bayesian method holds that , A priori is the source of human subjective judgment affecting prediction .
When training data is limited , Bayesian methods are generally better generalized , But when the number of training samples is large , There is usually a great computational cost .
In principle, , We should use parameters θ \theta θ The complete Bayesian posterior distribution is predicted , But single point estimation is often needed . A common reason to want to use point estimation is , For most meaningful models , Most calculations involving Bayesian posterior are very tricky , Point estimation provides a feasible approximate solution . We can still let the choice of prior influence point estimation take advantage of the advantages of Bayesian method , Instead of simply going back to maximum likelihood estimation . A reasonable way to do this is to choose the maximum a posteriori estimation . Maximum a posteriori estimation select the point with the maximum a posteriori probability :
θ M A P = arg max θ log p ( θ ∣ x ) = arg max θ log p ( x ∣ θ ) p ( θ ) \theta_{MAP}=\arg\max_{\theta}\log p(\theta|x)=\arg\max_{\theta}\log \frac{p(x|\theta)}{p(\theta)} θMAP=argθmaxlogp(θ∣x)=argθmaxlogp(θ)p(x∣θ)
Dexter log p ( x ∣ θ ) \log p(x|\theta) logp(x∣θ) Corresponding to the standard log likelihood term , log p ( θ ) \log p(\theta) logp(θ) Corresponding to a priori distribution .MAP The advantage of Bayesian inference is that it can use information from a priori , This information cannot be obtained from the training data . Relative to maximum likelihood estimation , This additional information helps to reduce the variance of the maximum a posteriori point estimation . However , This advantage comes at the expense of increased bias . Many normalized estimation methods , For example, the maximum likelihood learning of weight attenuation regularization , Can be interpreted as Bayesian inference MAP The approximate . This additional term added to the objective function during regularization corresponds to log p ( θ ) \log p(\theta) logp(θ). Not all regularization penalties correspond to MAP Bayesian inference . for example , Some regularization may not be the logarithm of a probability distribution . There are also regularizations that rely on data , Of course, it will not be a prior probability distribution .MAP Bayesian inference provides an intuitive way to design complex but interpretable regularization . for example , More complex penalty terms can be obtained by using Gaussian mixture distribution as a priori , Instead of a single Gaussian distribution .
边栏推荐
- C. Three displays codeforces round 485 (Div. 2)
- Global and Chinese market of music synthesizer 2022-2028: Research Report on technology, participants, trends, market size and share
- How FEA and FEM work together
- [tke] analysis of CLB loopback in Intranet under IPVS forwarding mode
- Finite element simulation in design
- MySQL InnoDB and MyISAM
- MySQL timestamp format conversion date format string
- B. Ternary Sequence(思维+贪心)Codeforces Round #665 (Div. 2)
- Web page live broadcast on demand RTMP streaming platform easydss newly added virtual live broadcast support dash streaming function
- Wechat official account debugging and natapp environment building
猜你喜欢
MySQL Advanced Series: locks - locks in InnoDB
Cognition and difference of service number, subscription number, applet and enterprise number (enterprise wechat)
Applet wxss
Siggraph 2022 | truly restore the hand muscles. This time, the digital human hands have bones, muscles and skin
A new weapon to break the memory wall has become a "hot search" in the industry! Persistent memory enables workers to play with massive data + high-dimensional models
Cap: multiple attention mechanism, interesting fine-grained classification scheme | AAAI 2021
[application recommendation] the hands-on experience and model selection suggestions of apifox & apipost in the recent fire
Advanced programmers must know and master. This article explains in detail the principle of MySQL master-slave synchronization
Some adventurer hybrid versions with potential safety hazards will be recalled
C. K-th not divisible by n (Mathematics + thinking) codeforces round 640 (Div. 4)
随机推荐
Advanced programmers must know and master. This article explains in detail the principle of MySQL master-slave synchronization
What is browser fingerprint recognition?
What can Lu yuanjiu Jiao buy?
Embedded Software Engineer written interview guide arm system and architecture
【Prometheus】2. Overview and deployment
C. Three displays(动态规划)Codeforces Round #485 (Div. 2)
Pageadmin CMS solution for redundant attachments in website construction
ZOJ——4104 Sequence in the Pocket(思维问题)
Finite element simulation in design
Detailed explanation of transpose convolution in pytorch
Is Shanjin futures safe? What are the procedures for opening futures accounts? How to reduce the futures commission?
ThinkPHP vulnerability exploitation tool
期货怎么开户安全些?哪些期货公司靠谱些?
Global and Chinese market for commercial barbecue smokers 2022-2028: Research Report on technology, participants, trends, market size and share
There are potential safety hazards Land Rover recalls some hybrid vehicles
Istio FAQ: sidecar stop sequence
What is the difference between optical fiber jumper and copper wire
Build go command line program tool chain
B. Ternary Sequence(思维+贪心)Codeforces Round #665 (Div. 2)
转置卷积学习笔记