当前位置:网站首页>Focus! Ten minutes to master Newton convex optimization
Focus! Ten minutes to master Newton convex optimization
2022-06-23 08:13:00 【Red stone】
We know , Gradient descent algorithm is a first-order optimization method using gradient , The Newton optimization algorithm I introduced today adopts second-order optimization . This paper will focus on the basic concept and derivation of Newton's method , The gradient descent is compared with Newton's method .
1
Newton's method solves the root of the equation
occasionally , In the case of complex equations , It is not easy to solve its root by general methods . Newton's method adopts the method of iteration and the idea of continuous approximation , The more accurate root of the equation can be obtained approximately .
The core idea of Newton's method is Taylor's first-order expansion . For example, for the equation f(x) = 0, We're at any point x0 It's about , First order Taylor expansion :
Make f(x) = 0, Bring it up , You can get :
Be careful , The approximation is used here , Got x It's not the root of the equation , It's just an approximate solution . however ,x Than x0 Closer to the root of the equation . The effect is shown below :
then , Using the iterative method to solve , With x by x0, Find the next position closer to the root of the equation . The iterative formula can be written as :
After a certain number of effective iterations , Generally, it can guarantee convergence at the root of the equation . A dynamic demonstration of the whole iterative convergence process is given below .
2
Newton convex optimization
The last part introduces how Newton's method solves the root of the equation , This property can be applied to the optimization of convex functions .
machine learning 、 Deep learning , The optimization problem of loss function is generally based on the gradient descent of the first derivative . Now? , Look at it from another Angle , To minimize the loss function , This is actually a maximum value problem , The first derivative of the corresponding function f'(x) = 0. in other words , If we find a way to f'(x) = 0 The point of x, The loss function is minimized , So the goal of model optimization is realized .
The goal now is to calculate f'(x) = 0 Corresponding x, namely f'(x) = 0 The root of the . It turns into the problem of finding roots , You can use the Newton method in the previous section .
Different from the previous section , First , Yes f(x) stay x0 Let's do a second-order Taylor expansion at :
The condition of the above formula is f(x) Approximately equal to f(x0). Make f(x) = f(x0), Also on (x - x0) Derivation , Available :
Again , although x It's not the optimal solution point , however x Than x0 Closer to the f'(x) = 0 The root of the . such , We can get the optimal iteration formula :
Through the iterative formula , You can find it all the time f'(x) = 0 The approximate roots of , Thus, the optimization objective of minimizing the loss function is realized .
3
gradient descent VS Newton method
Now? , Write the renewal formulas of gradient descent and Newton method respectively :
The gradient descent algorithm is to put the function in xn The position is approximated by a function , It's a straight line . Calculate the gradient , So the next optimization direction is the opposite direction of gradient . And Newton's method is to put the function in xn The position is approximated by a second-order function , It's a quadratic curve . Calculate the gradient and the second derivative , So as to determine the next optimization direction . The schematic diagram of first-order optimization and second-order optimization is as follows :
gradient descent : First order optimization
Newton method : Second order optimization
What is mentioned above is the difference between the gradient descent method and Newton method . So whose optimization effect is better ?
First , Let's look at the advantages of Newton's method . First of all , There is no parameter learning factor in the iterative updating formula of Newton method , There is no need to select appropriate learning factors through cross validation . second , Newton's method is thought to make use of the information of the curve itself , It converges more easily than the gradient descent method ( Fewer iterations ). The following figure is an example of minimizing an objective equation , The red curve is solved iteratively by Newton method , The green curve is solved by gradient descent method . obviously , Newton's optimization is faster .
then , Let's look at the shortcomings of Newton's method . We have noticed that in the iterative formula of Newton's method, in addition to solving the first derivative , And we have to calculate the second derivative . From a matrix point of view , The first derivative and the second derivative correspond to Jacobian matrix respectively (Jacobian matrix) And Hessian matrix (Hessian matrix).
Jacobian matrix
Hessian matrix
Newton's method doesn't just need calculation Hessian matrix , And you need to calculate Hessian The matrix of the inverse . When the amount of data is small , The speed of computation is not greatly affected . however , When there's a lot of data , Especially in deep neural networks , Calculation Hessian Matrix and its inverse matrix are very time consuming . From the overall effect point of view , Newton's optimization is not as fast as gradient descent algorithm . therefore , At present, most optimization strategies of neural network loss function are based on gradient descent .
It is worth mentioning that , In view of the shortcomings of Newton's method , At present, there are some improved algorithms . This kind of improved algorithm is called quasi Newton algorithm . What is more representative is BFGS and L-BFGS.
BFGS The algorithm uses an approximate method to calculate Hessian The matrix of the inverse , The operation speed is effectively improved . But we still need to put the whole Hessian The approximate inverse matrix is stored , Space costs are high .
L-BFGS The algorithm is right BFGS Algorithm improvement , No storage required Hessian approximation inverse matrix , Instead, the search direction of this round is obtained directly through the iterative algorithm , The space cost is greatly reduced .
in general , Optimization algorithm based on gradient descent , It is more widely used in practice , for example RMSprop、Adam etc. . however , An improved algorithm of Newton's method , for example BFGS、L-BFGS It also has its own characteristics , It also has strong practicability .
边栏推荐
- Moodle e-learning platform fixes the session hijacking error that leads to pre authorized rce
- List接口三个子实现类
- Introduction to MySQL system tables
- Imperva- method of finding regular match timeout
- 看了5本书,我总结出财富自由的这些理论
- Image segmentation - improved network structure
- qt 不规则图形 消除锯齿
- Openvino series 19 Openvino and paddleocr for real-time video OCR processing
- flutter 制作TabBar的圆角
- Location of firewalld configuration file
猜你喜欢
随机推荐
值得反复回味的81句高人高语
There are some limitations in cluster expansion and contraction
通过端口查文件
Openvino series 18 Real time object recognition through openvino and opencv (RTSP, USB video reading and video file reading)
深度学习------卷积(conv2D)底层
MFC radio button grouping
Configuration asmx not accessible
socket编程——select模型
Captain Abu's soul torture
PHP 文件包含 -ctf
MySQL小册子笔记 5 InnoDB 记录存储结构
openvino系列 19. OpenVINO 与 PaddleOCR 实现视频实时OCR处理
5本财富自由好书的精华
Image segmentation - improved network structure
The kernel fails to shut down when the easygbs program stops. How should I optimize it? [code attached]
开源技术交流丨批流一体数据同步引擎ChunJun数据还原-DDL功能模块解析
华为云服务器弹性公网IP无法ping
C Scrollview scroll up or scroll down
C#打印缩放
Vulnhub | dc: 3 | [actual combat]









