当前位置:网站首页>Focus! Ten minutes to master Newton convex optimization

Focus! Ten minutes to master Newton convex optimization

2022-06-23 08:13:00 Red stone

We know , Gradient descent algorithm is a first-order optimization method using gradient , The Newton optimization algorithm I introduced today adopts second-order optimization . This paper will focus on the basic concept and derivation of Newton's method , The gradient descent is compared with Newton's method .

1

Newton's method solves the root of the equation

occasionally , In the case of complex equations , It is not easy to solve its root by general methods . Newton's method adopts the method of iteration and the idea of continuous approximation , The more accurate root of the equation can be obtained approximately .

The core idea of Newton's method is Taylor's first-order expansion . For example, for the equation f(x) = 0, We're at any point x0 It's about , First order Taylor expansion :

Make f(x) = 0, Bring it up , You can get :

Be careful , The approximation is used here , Got x It's not the root of the equation , It's just an approximate solution . however ,x Than x0 Closer to the root of the equation . The effect is shown below :

then , Using the iterative method to solve , With x by x0, Find the next position closer to the root of the equation . The iterative formula can be written as :

After a certain number of effective iterations , Generally, it can guarantee convergence at the root of the equation . A dynamic demonstration of the whole iterative convergence process is given below .

2

Newton convex optimization

The last part introduces how Newton's method solves the root of the equation , This property can be applied to the optimization of convex functions .

machine learning 、 Deep learning , The optimization problem of loss function is generally based on the gradient descent of the first derivative . Now? , Look at it from another Angle , To minimize the loss function , This is actually a maximum value problem , The first derivative of the corresponding function f'(x) = 0. in other words , If we find a way to f'(x) = 0 The point of x, The loss function is minimized , So the goal of model optimization is realized .

The goal now is to calculate f'(x) = 0 Corresponding x, namely f'(x) = 0 The root of the . It turns into the problem of finding roots , You can use the Newton method in the previous section .

Different from the previous section , First , Yes f(x) stay x0 Let's do a second-order Taylor expansion at :

The condition of the above formula is f(x) Approximately equal to f(x0). Make f(x) = f(x0), Also on (x - x0) Derivation , Available :

Again , although x It's not the optimal solution point , however x Than x0 Closer to the f'(x) = 0 The root of the . such , We can get the optimal iteration formula :

Through the iterative formula , You can find it all the time f'(x) = 0 The approximate roots of , Thus, the optimization objective of minimizing the loss function is realized .

3

gradient descent VS Newton method

Now? , Write the renewal formulas of gradient descent and Newton method respectively :

The gradient descent algorithm is to put the function in xn The position is approximated by a function , It's a straight line . Calculate the gradient , So the next optimization direction is the opposite direction of gradient . And Newton's method is to put the function in xn The position is approximated by a second-order function , It's a quadratic curve . Calculate the gradient and the second derivative , So as to determine the next optimization direction . The schematic diagram of first-order optimization and second-order optimization is as follows :

gradient descent : First order optimization

Newton method : Second order optimization

What is mentioned above is the difference between the gradient descent method and Newton method . So whose optimization effect is better ?

First , Let's look at the advantages of Newton's method . First of all , There is no parameter learning factor in the iterative updating formula of Newton method , There is no need to select appropriate learning factors through cross validation . second , Newton's method is thought to make use of the information of the curve itself , It converges more easily than the gradient descent method ( Fewer iterations ). The following figure is an example of minimizing an objective equation , The red curve is solved iteratively by Newton method , The green curve is solved by gradient descent method . obviously , Newton's optimization is faster .

then , Let's look at the shortcomings of Newton's method . We have noticed that in the iterative formula of Newton's method, in addition to solving the first derivative , And we have to calculate the second derivative . From a matrix point of view , The first derivative and the second derivative correspond to Jacobian matrix respectively (Jacobian matrix) And Hessian matrix (Hessian matrix).

Jacobian matrix

Hessian matrix

Newton's method doesn't just need calculation Hessian matrix , And you need to calculate Hessian The matrix of the inverse . When the amount of data is small , The speed of computation is not greatly affected . however , When there's a lot of data , Especially in deep neural networks , Calculation Hessian Matrix and its inverse matrix are very time consuming . From the overall effect point of view , Newton's optimization is not as fast as gradient descent algorithm . therefore , At present, most optimization strategies of neural network loss function are based on gradient descent .

It is worth mentioning that , In view of the shortcomings of Newton's method , At present, there are some improved algorithms . This kind of improved algorithm is called quasi Newton algorithm . What is more representative is BFGS and L-BFGS.

BFGS The algorithm uses an approximate method to calculate Hessian The matrix of the inverse , The operation speed is effectively improved . But we still need to put the whole Hessian The approximate inverse matrix is stored , Space costs are high .

L-BFGS The algorithm is right BFGS Algorithm improvement , No storage required Hessian approximation inverse matrix , Instead, the search direction of this round is obtained directly through the iterative algorithm , The space cost is greatly reduced .

in general , Optimization algorithm based on gradient descent , It is more widely used in practice , for example RMSprop、Adam etc. . however , An improved algorithm of Newton's method , for example BFGS、L-BFGS It also has its own characteristics , It also has strong practicability .

原网站

版权声明
本文为[Red stone]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/01/202201121447326170.html