当前位置:网站首页>Summary of neural network training trick
Summary of neural network training trick
2022-06-22 10:01:00 【Xiaobai learns vision】
Click on the above “ Xiaobai studies vision ”, Optional plus " Star standard " or “ Roof placement ”
Heavy dry goods , First time delivery come from | You know author | Anticoder
link | https://zhuanlan.zhihu.com/p/59918821
This article is for academic exchange only , If there is any infringement , Please contact to delete

The neural network is well constructed , What if training doesn't produce good results ? Clearly agreed Fit any function ( Generally continuous ) , Say enough data (Occam's razor), Carefully designed neural networks can get better accuracy and generalization than other algorithms ( Of course not what I said ), Why can't you feel it ?
intuitive , Because neural networks can be designed at will , There are few prior assumptions , There are many parameters , More super parameters , The degree of freedom of the model is very high , Careful design becomes more difficult for beginners . Here are some of the simplest trick, It's definitely not comprehensive , Comments are welcome . Because I'm a rookie !
Here are some noteworthy parts , There are some simple principles to explain , The details cannot be exhaustive , Please refer to professional articles .
Let's start with a problem and decide to use neural network . generally speaking ,
First, choose what you want to use structure , Such as one-to-one , Fixed window , Data dimension granularity ,MLP,RNN perhaps CNN etc.
Nonlinear selection ,sigmoid,tanh,ReLU, Or some variation , commonly tanh Than sigmoid It works better ( Simple description , The two are very similar ,tanh yes rescaled Of sigmoid,sigmoid The outputs are all positive numbers , according to BP The rules , The sign of the gradient of the weight of the neuron in a certain layer is the same as that of the error in the later layer , in other words , If the error of the next layer is positive , Then the weight of this layer should be reduced , If a negative , Then the gradient of this layer is all negative , The weights are all increased , The weights are either all increased , Or both , This is obviously problematic ;tanh In order to 0 Of the center of symmetry , This will eliminate the bias caused by the systematic bias in the weight update . Of course this is heuristic , Is not to say that tanh Certain ratio sigmoid Good. ),ReLU It's also a good choice , The biggest benefit is that , When tanh and sigmoid At saturation, there will be the problem that the gradient disappears ,ReLU There will be no such problem , And the calculation is simple , Of course it will produce dead neurons, I'll tell you more about .
Gradient Check, If you think the Internet feedforward No problem , that GC Can guarantee BP The process is nothing bug. It is worth mentioning that , If feedforward There is a problem , But the error is almost the same ,GC It also feels right . Most of the time GC It can help you find many problems ! Steps are as follows :

Then if GC Failure , There may be something wrong with some parts of the network , There may also be problems with the whole network ! You don't know what went wrong , Then what shall I do? ? Build a visual process to monitor every link , This will let you know if there is a problem in every part of your network !! Here's another one trick, Build a simple task first ( Like you do MNIST Number recognition , You can identify 0 and 1, If successful, you can add more identification numbers ); Then test your... Step by step from simple to complex model, Look where there is a problem . Let's give you an example , First use the fixed data Through single layer softmax see feedforward effect , then BP effect , Then add a single layer neuron unit See the effect ; Add one more layer ; increase bias..... Until the final look is built , Systematically check every step !
Parameter initialization is also important ! The main consideration is the value range of your activation function and the large gradient range !
Hidden bias It is usually initialized to 0 Can ; Output layer bias Consider using reverse activation of mean targets perhaps mean targets( Very intuitive, right ) weights Initializes generally small random numbers , such as Uniform,Gaussion

More at ease , Visualize each layer feedforward Value range of output , Gradient range , Modify it to fall into the middle area of the activation function ( Gradients are similar to linear ); If it is ReLU Make sure that the output is not mostly negative , You can give bias A little straight noise, etc . Of course, there is another point that you can't let neurons output the same , The reason is simple
optimization algorithm , It's usually used mini-batch SGD, Never use full batch gradient( slow ). In general , For big data sets 2nd order batch method such as L-BFGS good , But there will be a lot of extra calculations 2nd The process ; Small data set, ,L-BFGS Or conjugate gradient is better .(Large-batch L-BFGS extends the reach of L-BFGSLe et al. ICML 2001)
mini-batch The main advantages are : Matrix computing can be used to speed up parallelism ; The introduced randomness can avoid being trapped in the local optimal value ; Parallel computing multiple gradients, etc . On this basis, some improvements are also very effective ( because SGD Really sensitive ), such as Momentum, His intention is to add a little friction to the original and new , The effect of point acceleration on velocity , If you update the gradient several times in the same direction , It shows that this direction is right , At this time, add a new step , Suddenly there is a direction , It will only affect the original direction less , Because it can reduce the error caused by data . When you use momentum Can be appropriately reduced global learning rate

momentum
Learning rate , Those who have run through neural networks know that this effect is quite big . Generally, it is either fixed or fixed lr, Or with the training lr Gradually smaller
Scheme 1 : When the verification error no longer decreases ,lr Reduce to the original 0.5
Option two : The reduction ratio of convergence can be guaranteed theoretically ,O(1/t),t Is the number of iterations
Option three : It is best to use an adaptive learning rate , such as Adagrad(Duchi et al. 2010) etc.
Briefly ,Adagrad It is very suitable for models with different data occurrence frequencies , such as word2vec, You certainly want to see very few words with very large weight updates , Keep them away from conventional words , Learn the meaning of distance measurement in vector space , There are many words (the,very,often) Each update is small .

adagrad
According to the formula above , If you enter a local optimum, Parameters may not be updated , Consider every other paragraph epoch,reset sum term
See if your model has the ability to fit !(training error vs. validation error)
without , Find a way to over fit it !(r u kidding?! ha-ha ), generally speaking , When There are more parameters than training Data time , Models are capable of remembering data , Is the ability of the model guaranteed first
If it is fitted , Then it can be further optimized , General deep learning breakthrough All the methods come from better regularization method, There are many methods to solve the problem of over fitting, which are not discussed here . such as Reduce the model (layers, units);L1,L2 Regular (weight decay);early stop( According to the data set size , Every once in a while epoch( For example, small data sets are set every 5epoch, Big every (1/3epoch)) Save the model , The final choice validation error The smallest model );sparsity constraints on hidden activation;Dropout;data
augumentation (CNN Some change invariance should be paid attention to ) etc.
------
The general process is as above , Another great work of God
Practical Recommendations for Gradient-Based Training of Deep Architectures Y. Bengio(2012)
https://link.springer.com/chapter/10.1007/978-3-642-35289-8_26
Additional mention is unsupervised Preliminary training . In fact, when the data is not enough, you can also find similar tasks to do migration learning ,fine-tuning etc. .
Last , You can see so many super parameters of a network , How to choose these super parameters ? The article also says :Random hyperparameter search!
Most of the above mentioned are supervised learning, about unsupervised learning You can do it fine tuning
Next, I will list some modules , Welcome to add !!
Standardization (Normalization)
quite a lot machine learning Models need to , I will not discuss more here , Neural network hypothesis inputs/outputs The approximate mean value is 0 The variance of 1 Distribution . Mainly to treat each characteristic fairly ; Make the optimization process smooth ; Eliminate dimensional effects, etc
z-score; min-max; decimal scaling etc.
scale The importance of control features : Big scale Of output Features produce greater error; Big scale Of input The characteristic of the dominant network can make the network more sensitive to this characteristic , Produce big update
Some features originally have a small value range, which requires special attention , Avoid producing NaNs
Even without standardizing your network to train , Maybe the first few floors do similar things , Invisibly increases the complexity of the network
It's usually all inputs The characteristics of are independently standardized according to the same rules , If there are special needs for the task , Certain characteristics can be treated specially
Examination result (Results Check)
It is a bit similar to a monitoring system in a model ( Preprocessing , Training , The prediction process should be ), This step can help you find out what went wrong with your model , It is better to find a visual method , Be clear at a glance , For example, the image is very intuitive .
It should be noted that , You need to understand what you set up error The meaning of , Even if the training process error Is decreasing , also Need to come with the real error Compare , although training error Less , But it may not be enough , The real world needs smaller error, It shows that the model learning is not enough
When stay training In the process work after , Then go to see in validation The effect on the set
Before updating the network structure , best Make sure that every link has “ monitor ”, Don't blindly do useless work
Preprocessing (Pre-Processing Data)
In reality, the same data can be expressed in different ways , Like a mobile car , You look at it from different angles , It does the same thing . You should make sure that the same data is observed from the South and from the West , It should be similar !
The neural network assumes that the data distribution space is continuous
Reduce the error caused by the diversity of data representation ; Indirectly reduces unnecessary tasks in the first few layers of the network “ equivalent ” The complexity of mapping
Regularization (Regularization)
increase Dropout, random process , noise ,data augumentation etc. . Even if there is enough data , You think it impossible over-fitting, So it's better to have regular , Such as dropout(0.99)
On the one hand, it alleviates over fitting , On the other hand, the introduction of randomness , It can smooth the training process , Speed up the training process , Handle outliers
Dropout Can be seen as ensemble, Feature sampling , amount to bagging Many subnetworks ; Dynamic expansion during training has a similar variation Input dataset for .( In a single layer network , Like a compromise Naiive bayes( All feature weights are independent ) and logistic regression( There is a relationship between all the features );
Generally, the more complex the large-scale network ,Dropout The better the result. , It's a strong one regularizer!
The best way to prevent over-fitting There is a lot of non duplicate data
Batch Size Too big
Too big batch size Will be reduced gradient descend The randomness of , Have a negative impact on the accuracy of the model .
If you can tolerate too long training time , It's best to start using as small as possible batch size(16,8,1)
Big batch size More is needed epoch To reach a better level
reason 1: Help you jump out during your workout local minima
reason 2: Let the training enter a more gentle local minima, Improve generalization
Learning rate lr
Get rid of gradient clipping( Generally, there are... By default ), During training , Find the biggest , Make model error It won't explode lr, Then use a smaller one lr Training
In general data outliers Will produce large error, And then big gradient, Get big weight update, Will make the best lr It's hard to find
Preprocess the data ( Remove outliers),lr It is generally unnecessary to set clipping
If error explode, Then add gradient clipping It's just temporary relief , The reason is still data problems
The activation function of the last layer
Limit the range of output , Generally, no activation is required
You need to think carefully about what the input is , The value range of output after standardization , If the output is positive or negative , Do you use ReLU,sigmoid Obviously not ; Multi category tasks are generally used softmax( It is equivalent to normalizing the output to a probability distribution )
Activation is just a mapping , In theory
If the output does not error Obviously not , There is no gradient, The model doesn't learn anything
It's usually used tanh, A problem arises when the gradient is -1 or 1 It's very small nearby , Neuron saturation learning is slow , Easy to generate gradient messages , The model produces more approximations -1 or 1 Value
Bad Gradient(Dead Neurons)
Use ReLU Activation function , Because its gradient is less than zero 0, May affect model performance , Even the model will not be updated
When the model is found to follow epoch Conduct , Training error Don't change , Maybe all the neurons “ die ” 了 . At this time, try to change the activation function, such as leaky ReLU,ELU, Look at the training error change
Use ReLU It is necessary to add a little noise to the parameters , Break the complete symmetry to avoid 0 gradient , Even to biases Add noise
Relatively speaking, for sigmoid, Because it's in 0 Value is the most sensitive , Gradient maximum , Initialize all to 0 That's all right.
Any operation on gradients , such as clipping, rounding, max/min May cause similar problems
ReLU relative Sigmoid advantage : Unilateral inhibition ; Wide excitement boundary ; Sparse activation ; The solution gradient disappears
Initialization weight
Generally speaking, it is randomly initialized to some small numbers , It's not that simple , Some network structures require some specific initialization methods , If the initialization is not good, you may not get the effect of the article ! You can try some popular methods to find useful initialization
Too small : The signal transmission gradually shrinks and is difficult to work
Too big : The gradual amplification of signal transmission leads to divergence and failure
The more popular ones are 'he', 'lecun', 'Xavier'( Let the weight satisfy 0 mean value ,2/( Enter the number of nodes + Number of output nodes ))
biases It is usually initialized to 0 Can
Each layer initialization is important
The Internet is too deep
It is said that the depth network has higher accuracy , But the depth is not piled up blindly , It must be based on the effect of shallow network , Increase the depth . The depth increase is to increase the accuracy of the model , If you can't learn anything in the shallow , It won't work if it's too deep .
Start with 3-8 layer , When the effect is good , In order to get higher accuracy , Try to deepen the network
So the optimization method is also useful in shallow layer , If the effect is not good , Definitely not deep enough
The training and prediction process slows down as the network deepens
Hidden neurons The number of
It's best to refer to researcher Structure on similar tasks , commonly 256-1024
Too much : Slow training , Difficult to remove noise (over-fitting)
too little : The fitting ability decreases
Consider how much information a real variable has to pass , Then add a little more ( consider dropout; Redundant expression ; Room for estimation )
Classification task : Initial attempt 5-10 Times the number of categories
Return to the task : Initial attempt 2-3 Double input / Output characteristic number
Intuition is important here
The final impact is actually small , But the training process is slow , Try more
loss function
Multi category tasks are generally used cross-entropy no need MSE
Multi classification is generally used softmax, In less than 0 The gradient in the range is very small , Add one more log Can improve this problem
avoid MSE Resulting in a decrease in the learning rate , The learning rate is controlled by the output error ( Just push it yourself )
AE Dimension reduction
Use... For intermediate hidden layers L1 Regular , The sparsity of hidden nodes is controlled by penalty coefficient
SGD
Unstable algorithm , Set different learning rates , There is a big gap in the results , It needs careful adjustment
Generally, I hope to start a big , Accelerate convergence , Later stage is small , The stability falls into the local optimal solution .
Adaptive algorithms can also be used ,Adam,Adagrad,Adadelta So as to reduce the burden of adjusting parameters ( Generally, you can use the default value )
about SGD Need to know the learning rate ,Momentum,Nesterov And so on
It is worth mentioning that many local optimal solutions of neural networks may achieve better results , However, the global optimal solution is easy to over fit
CNN Use
Neural network is a feature learning method , Its ability depends on the hidden layer , More connections mean an explosion of parameters , The complexity of the model directly leads to many problems . Such as serious over fitting , High computational complexity .
CNN Its superior performance is well worth using , The number of parameters is only related to the size of convolution kernel , Quantity is related to , Ensure the number of hidden nodes ( Related to convolution step size ) At the same time , The number of parameters is greatly reduced ! Of course CNN More for images , Other tasks are abstracted by yourself , Try more !
Here is a brief introduction CNN Of trick
pooling Or the convolution size and step size are different , Increase data diversity
data augumentation, Avoid overfitting , Improve generalization , Add noise disturbance
weight regularization
SGD Use decay Training methods
Finally using pooling(avgpooling) Instead of full connection , Reduce the number of parameters
maxpooling Instead of avgpooling, avoid avgpooling The blurring effect
2 individual 3x3 Replace a 5x5 etc. , Reduce parameters , Add nonlinear mapping , send CNN Strong ability to learn characteristics
3x3,2x2 window
Pre training methods, etc
After data preprocessing (PCA,ZCA) Feed model
Output results window ensemble
The intermediate node acts as the auxiliary output node , Equivalent to model fusion , At the same time, the back-propagation gradient signal is added , Provides additional regularization
1x1 Convolution , Boast channel organization information , Improve network expression , The dimension of output can be reduced , Low cost , High cost performance , Add nonlinear mapping , accord with Hebbian principle
NIN Increase the adaptability of the network to different scales , similar Multi-Scale thought
Factorization into small convolution,7x7 use 1x7 and 7x1 Instead of , Save parameters , Add nonlinear mapping
BN Reduce Internal Covariance Shift problem , Improve learning speed , Reduce overfitting , May cancel dropout, Increase the learning rate , Lighten regularity , Data enhancement to reduce optical distortion
When the model encounters degradation problems, consider shortcut structure , Increase the depth
wait
RNN Use
Small details are very similar to others , Simply say two other aspects of personal feelings , Actually RNN It's also shortcut structure
It's usually used LSTM The structure prevents BPTT The gradient disappears ,GRU Have fewer parameters , Priority can be given to
Pretreatment details ,padding, Sequence length setting , Rare word processing, etc
The data volume of a general language model must be very large
Gradient Clipping
Seq2Seq Structural considerations attention, Premise: large amount of data
The sequence model has excellent performance CNN+gate structure
The general generation model can refer to GAN,VAE, Generate random variables
RL The combination of the framework of
Small amount of data, simple MLP
Prediction uses hierarchical structure to reduce training complexity
Design sampling methods , Increase the convergence rate of the model
Add multiple levels shortcut structure
That's it today , Welcome to add .
The good news !
Xiaobai learns visual knowledge about the planet
Open to the outside world

download 1:OpenCV-Contrib Chinese version of extension module
stay 「 Xiaobai studies vision 」 Official account back office reply : Extension module Chinese course , You can download the first copy of the whole network OpenCV Extension module tutorial Chinese version , Cover expansion module installation 、SFM Algorithm 、 Stereo vision 、 Target tracking 、 Biological vision 、 Super resolution processing and other more than 20 chapters .
download 2:Python Visual combat project 52 speak
stay 「 Xiaobai studies vision 」 Official account back office reply :Python Visual combat project , You can download, including image segmentation 、 Mask detection 、 Lane line detection 、 Vehicle count 、 Add Eyeliner 、 License plate recognition 、 Character recognition 、 Emotional tests 、 Text content extraction 、 Face recognition, etc 31 A visual combat project , Help fast school computer vision .
download 3:OpenCV Actual project 20 speak
stay 「 Xiaobai studies vision 」 Official account back office reply :OpenCV Actual project 20 speak , You can download the 20 Based on OpenCV Realization 20 A real project , Realization OpenCV Learn advanced .
Communication group
Welcome to join the official account reader group to communicate with your colleagues , There are SLAM、 3 d visual 、 sensor 、 Autopilot 、 Computational photography 、 testing 、 Division 、 distinguish 、 Medical imaging 、GAN、 Wechat groups such as algorithm competition ( It will be subdivided gradually in the future ), Please scan the following micro signal clustering , remarks :” nickname + School / company + Research direction “, for example :” Zhang San + Shanghai Jiaotong University + Vision SLAM“. Please note... According to the format , Otherwise, it will not pass . After successful addition, they will be invited to relevant wechat groups according to the research direction . Please do not send ads in the group , Or you'll be invited out , Thanks for your understanding ~边栏推荐
- Lexical Sign Sequence
- 钟珊珊:被爆锤后的工程师会起飞|OneFlow U
- thinkphp5.0.24反序列化漏洞分析
- Quickly master asp Net authentication framework identity - login and logout
- It was exposed that more than 170million pieces of private data had been leaked, and Xuetong responded that no clear evidence had been found
- Introduction to code audit learning notes
- [cmake命令笔记]find_path
- [structure training camp - module 3]
- char[],char *,string之间转换
- DAO 的未来:构建 web3 的组织原语
猜你喜欢
随机推荐
秋招秘籍C
使用pytorch mask-rcnn进行目标检测/分割训练
Software project management 8.3 Agile project quality activities
Double machine hot standby of firewall on ENSP
[popular science] to understand supervised learning, unsupervised learning and reinforcement learning
SQL编程task02作业-基础查询与排序
缓存穿透利器之「布隆过滤器」
【深度学习】不得了!新型Attention让模型提速2-4倍!
thinkphp3.2.3日志包含分析
Zuckerberg's latest VR prototype is coming. It is necessary to confuse virtual reality with reality
三个月让软件项目成功“翻身”!
Set up multiple web sites
DAO 的未来:构建 web3 的组织原语
Realize multi-user isolated FTP in AD environment
Don't be silly enough to distinguish hash, chunkhash and contenthash
Software engineering topics
HDU - 7072 双端队列+对顶
10-2xxe vulnerability principle and case experiment demonstration
[Luogu] P1083 [NOIP2012 提高组] 借教室(差分)
谁说postgresql 没有靠谱的高可用(2)


![[学习笔记] 回滚莫队](/img/19/d374dd172b9609a3f57de50791b19e.png)






