当前位置:网站首页>CV fully connected neural network
CV fully connected neural network
2022-06-23 18:57:00 【Bachuan Xiaoxiaosheng】
All connected neural networks
Cascade multiple transforms
For example, two-layer full connection
f = w 2 m a x ( 0 , w 1 x + b 1 ) + b 2 f=w_{2}max(0,w_{1}x+b_{1})+b_{2} f=w2max(0,w1x+b1)+b2
Nonlinear operation is necessary
It is different from the linear classifier in that it can deal with linear non separable cases
structure
- Input
- Hidden layer
- Output
- A weight
In addition to the number of input layers, there are several neural networks
Activation function
Remove non-linear operation ( Activation function ) The neural network degenerates into a linear classifier
Commonly used
- sigmoid
1 ( 1 + e − x ) \frac{1}{(1+e^{-x})} (1+e−x)1
Compress the value to 0~1 - ReLU
m a x ( 0 , x ) max(0,x) max(0,x) - tanh
e x − e − x e x + e − x \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} ex+e−xex−e−x
Compress the value to -1~1 - Leaky ReLU
m a x ( 0.1 x , x ) max(0.1x,x) max(0.1x,x)
The structure design
- Width
- depth
The more neurons there are , The stronger the nonlinearity , The interface is more complex
But not more is better , The more complicated the better
You should choose according to the difficulty of the task
softmax
First index and then normalize
The output can be transformed into a probability distribution
Cross entropy
It is necessary to measure the predicted distribution and the real distribution
The true distribution is generally one-hot Formal
H ( p , q ) = − ∑ x p ( x ) l o g ( q ( x ) ) H(p,q)=-\sum_{x}p(x)log(q(x)) H(p,q)=−x∑p(x)log(q(x))
H(p,q)=KL(p||q)+H(p)
However, the truth value is generally one-hot,H(p) Usually it is 0
Sometimes I use KL The divergence
Calculation chart
Directed graph , Express input , Output , Calculation relationship between intermediate variables , Each node corresponds to an operation
The computer can use the chain rule to calculate the gradient of each position of any complex function
The calculation diagram has the problem of granularity
The gradient disappears 、 Gradient explosion
sigmoid The derivative of is very small in a large range , Plus the multiplication of the chain rule , Cause the derivative to return to 0, It's called gradient vanishing
It can also cause a gradient explosion , It can be solved by gradient clipping
(Leakly)ReLU The derivative of a function is greater than 0 Time is always one , Will not cause the gradient to disappear / The explosion , Gradient flow is smoother , Convergence is faster
Improved gradient algorithm
- gradient descent —— All samples need to be calculated at one time , It takes too much time
- Stochastic gradient descent —— Sometimes there is noise , Low efficiency
- Small batch gradient descent —— compromise
There are still problems
Encounter Valley , Oscillate in one direction , Slow descent in the other direction
Doing too much useless work in the direction of vibration
Adjusting the step size alone cannot solve
Solution
Momentum method
Use historical accumulation
The oscillation directions cancel each other out , Acceleration in the flat descent direction
It can also break through local minimum and saddle point
Pseudo code
Initialization speed v=0
loop :
---- Calculate the gradient g
---- Speed update v = μ v + g v=\mu v+g v=μv+g
---- Update the weights w = w − ε v w=w-\varepsilon v w=w−εv
μ \mu μ Value [0,1), commonly 0.9
Adaptive gradient method
Reduce the vibration direction step , Increase the flat direction step size
The larger square of gradient amplitude is the direction of oscillation , The smaller is the flat direction
Pseudo code
Initialize the cumulative variable r=0
loop :
---- Calculate the gradient g
---- Cumulative square gradient r=r+g*g
---- Update the weights w = w − ε r + δ g w=w-\frac{\varepsilon}{\sqrt{r}+\delta}g w=w−r+δεg
δ \delta δ Prevent division by zero , Usually it is 1 0 − 5 10^{-5} 10−5
The defect is that the accumulation time is too long, and the velocity in the flat direction will also be suppressed
RMSProp
The solution is to mix historical information
The improvement is to replace the cumulative square gradient with r = ρ r + ( 1 − ρ ) g ∗ g r=\rho r+(1-\rho )g*g r=ρr+(1−ρ)g∗g
ρ \rho ρ take [0,1) It's usually 0.999
ADAM
Use both momentum and adaptation
Pseudo code
Initialize the cumulative variable r=0,v=0
loop :
---- Calculate the gradient g
---- Cumulative gradient v = μ v + ( 1 − μ ) g v=\mu v+(1-\mu)g v=μv+(1−μ)g
---- Cumulative square gradient r = ρ r + ( 1 − ρ ) g ∗ g r=\rho r+(1-\rho)g*g r=ρr+(1−ρ)g∗g
---- Correct the deviation v ^ = v 1 − μ t , r ^ = r 1 − ρ t \hat{v}=\frac{v}{1-\mu^{t}},\hat{r}=\frac{r}{1-\rho^{t}} v^=1−μtv,r^=1−ρtr
---- Update the weights w = w − ε r + δ v ^ w=w-\frac{\varepsilon}{\sqrt{r}+\delta}\hat{v} w=w−r+δεv^
Decay rate ρ \rho ρ Momentum coefficient μ \mu μ The suggestion is 0.999 and 0.9
Correcting the deviation can alleviate the initial cold start problem of the algorithm
use ADAM There is no need to manually adjust parameters
Weight initialization
All zero initialization
All neurons have the same output , Parameter update is the same , Can't train
Random weight initialization
The weight is Gaussian distribution
But the output will be saturated or close to zero in the process of propagation , Can't train
Xavier initialization
The goal is to keep the activation value and local gradient variance of each layer of the network consistent as much as possible in the propagation process , Search for w The distribution of makes the input and output variances consistent
When var(w)=1/N when , The input and output variances are consistent
HE initialization
Apply to ReLU, The weights are sampled at N ( 0 , 2 / N ) \mathcal{N}(0,2/N) N(0,2/N)
Batch of normalization
Direct batch normalization of neuron output
Usually after full connection , Before nonlinear operation
It can prevent the output from falling in the small gradient area , It can solve the problem of gradient disappearance
Mean and variance can be learned by network
Under fitting
The ability of model description is too weak , The model is too simple
Over fitting
Good performance in training set but poor performance in real scene
Only remember the training samples without extracting features
The fundamental problem of machine learning
- Optimize Get the best performance on the training set
- generalization Get good performance on unknown data
At the beginning of training : Optimize generalization synchronization
Later in training : Generalization decreases , There has been a fit
resolvent
- Optimal scheme —— Get enough data
- Suboptimal scheme —— The tuning model allows you to store the size of the information
- Adjust model size
- Constraint model weights , Use regular terms
Random deactivation
Give hidden neurons a chance not to be activated , Use... During training Dropout, Make some neurons output randomly 0
- Reduced model capacity
- Encourage weight dispersion , There is regularization
- It can be regarded as an integration model
Existing problems
The test phase does not use random deactivation, resulting in a different output range from the training
Solution
Multiply the activated neuron output by a factor during training
Hyperparameters
Learning rate
- Too big Unable to converge
- Larger Shock , Unable to reach the optimum
- Too small Convergence time is too long
- Moderate Fast convergence , The result is good
Hyperparametric search method
- The grid search
- Random search —— More combinations of super parameters , General choice
First rough search and then fine search
Generally in log Spatial search
边栏推荐
- 傑理之串口設置好以後打印亂碼,內部晶振沒有校准【篇】
- Borui data attends Alibaba cloud observable technology summit, and digital experience management drives sustainable development
- 获取设备信息相关
- 韬略生物冲刺科创板:年亏损过亿 实控人张大为夫妇为美国籍
- Halcon knowledge: contour operator on region (1)
- Develop small programs and official account from zero [phase I]
- Machine learning jobs
- 盘点四种WiFi加密标准:WEP、WPA、WPA2、WPA3
- 产品设计- 需求分析
- vPROM笔记
猜你喜欢

How far is the rise of cloud native industry applications from "available" to "easy to use"?

Talk about row storage and column storage of database

微机原理第八章笔记整理
![Develop small programs and official account from zero [phase I]](/img/02/77386ba3fe50b16018f77115b99db6.png)
Develop small programs and official account from zero [phase I]

IOT platform construction equipment, with source code

涂鸦智能通过聆讯:拟回归香港上市 腾讯是重要股东

#20Set介绍与API

三一重能科创板上市:年营收102亿 市值470亿

Nanxin semiconductor rushes to the scientific innovation board: its annual revenue is RMB 980 million. Sequoia Xiaomi oppo is the shareholder

企业如何做好业务监控?
随机推荐
How far is the rise of cloud native industry applications from "available" to "easy to use"?
sed replace \tPrintf to \t//Printf
傑理之串口設置好以後打印亂碼,內部晶振沒有校准【篇】
Operation of simulated test platform for elevator driver test questions in 2022
Borui data attends Alibaba cloud observable technology summit, and digital experience management drives sustainable development
(10)二叉树
三一重能科创板上市:年营收102亿 市值470亿
涂鸦智能通过聆讯:拟回归香港上市 腾讯是重要股东
产品设计- 需求分析
用软件可编程FPGA加速网络边缘的移动应用总结
Docker builds redis cluster
User analysis aarrr model (pirate model)
Why create a public OKR?
【对比学习】koa.js、Gin与asp.net core——中间件
Taolue biology rushes to the scientific innovation board: the actual controllers with annual losses of more than 100 million are Zhang Dawei and his wife, who are American nationals
吃顿饭的时间,学会simulink之BLDC基本原理
【翻译】具有时间结构的特定信号的鲁棒提取(上)
指标(复杂指标)定义和模型
Js25 topic
Jerry's serial port communication serial port receiving IO needs to set digital function [chapter]