当前位置:网站首页>Summary of neural network training trick

Summary of neural network training trick

2022-06-22 10:01:00 Xiaobai learns vision

Click on the above “ Xiaobai studies vision ”, Optional plus " Star standard " or “ Roof placement

 Heavy dry goods , First time delivery 

come from | You know     author | Anticoder

link | https://zhuanlan.zhihu.com/p/59918821

This article is for academic exchange only , If there is any infringement , Please contact to delete

911f1c7b61c9759e7a9fa940be907098.png

The neural network is well constructed , What if training doesn't produce good results ? Clearly agreed Fit any function ( Generally continuous ) , Say enough data (Occam's razor), Carefully designed neural networks can get better accuracy and generalization than other algorithms ( Of course not what I said ), Why can't you feel it ?


intuitive , Because neural networks can be designed at will , There are few prior assumptions , There are many parameters , More super parameters , The degree of freedom of the model is very high , Careful design becomes more difficult for beginners . Here are some of the simplest trick, It's definitely not comprehensive , Comments are welcome . Because I'm a rookie !


Here are some noteworthy parts , There are some simple principles to explain , The details cannot be exhaustive , Please refer to professional articles .

Let's start with a problem and decide to use neural network . generally speaking ,

  • First, choose what you want to use structure , Such as one-to-one , Fixed window , Data dimension granularity ,MLP,RNN perhaps CNN etc.

  • Nonlinear selection ,sigmoid,tanh,ReLU, Or some variation , commonly tanh Than sigmoid It works better ( Simple description , The two are very similar ,tanh yes rescaled Of sigmoid,sigmoid The outputs are all positive numbers , according to BP The rules , The sign of the gradient of the weight of the neuron in a certain layer is the same as that of the error in the later layer , in other words , If the error of the next layer is positive , Then the weight of this layer should be reduced , If a negative , Then the gradient of this layer is all negative , The weights are all increased , The weights are either all increased , Or both , This is obviously problematic ;tanh In order to 0 Of the center of symmetry , This will eliminate the bias caused by the systematic bias in the weight update . Of course this is heuristic , Is not to say that tanh Certain ratio sigmoid Good. ),ReLU It's also a good choice , The biggest benefit is that , When tanh and sigmoid At saturation, there will be the problem that the gradient disappears ,ReLU There will be no such problem , And the calculation is simple , Of course it will produce dead neurons, I'll tell you more about .

  • Gradient Check, If you think the Internet feedforward No problem , that GC Can guarantee BP The process is nothing bug. It is worth mentioning that , If feedforward There is a problem , But the error is almost the same ,GC It also feels right . Most of the time GC It can help you find many problems ! Steps are as follows :

7c99b19cd07e308ba77e71c5043d8a3f.png

Then if GC Failure , There may be something wrong with some parts of the network , There may also be problems with the whole network ! You don't know what went wrong , Then what shall I do? ? Build a visual process to monitor every link , This will let you know if there is a problem in every part of your network !! Here's another one trick, Build a simple task first ( Like you do MNIST Number recognition , You can identify 0 and 1, If successful, you can add more identification numbers ); Then test your... Step by step from simple to complex model, Look where there is a problem . Let's give you an example , First use the fixed data Through single layer softmax see feedforward effect , then BP effect , Then add a single layer neuron unit See the effect ; Add one more layer ; increase bias..... Until the final look is built , Systematically check every step !

  • Parameter initialization is also important ! The main consideration is the value range of your activation function and the large gradient range !
    Hidden bias It is usually initialized to 0 Can ; Output layer bias Consider using reverse activation of mean targets perhaps mean targets( Very intuitive, right ) weights Initializes generally small random numbers , such as Uniform,Gaussion

0cdd7a2d818c7209a1ba51189bd47f85.png

More at ease , Visualize each layer feedforward Value range of output , Gradient range , Modify it to fall into the middle area of the activation function ( Gradients are similar to linear ); If it is ReLU Make sure that the output is not mostly negative , You can give bias A little straight noise, etc . Of course, there is another point that you can't let neurons output the same , The reason is simple

  • optimization algorithm , It's usually used mini-batch SGD, Never use full batch gradient( slow ). In general , For big data sets 2nd order batch method such as L-BFGS good , But there will be a lot of extra calculations 2nd The process ; Small data set, ,L-BFGS Or conjugate gradient is better .(Large-batch L-BFGS extends the reach of L-BFGSLe et al. ICML 2001)

mini-batch The main advantages are : Matrix computing can be used to speed up parallelism ; The introduced randomness can avoid being trapped in the local optimal value ; Parallel computing multiple gradients, etc . On this basis, some improvements are also very effective ( because SGD Really sensitive ), such as Momentum, His intention is to add a little friction to the original and new , The effect of point acceleration on velocity , If you update the gradient several times in the same direction , It shows that this direction is right , At this time, add a new step , Suddenly there is a direction , It will only affect the original direction less , Because it can reduce the error caused by data . When you use momentum Can be appropriately reduced global learning rate

ea8243fbf5adf6f9fe86f8b2fa7be89f.png

momentum

  • Learning rate , Those who have run through neural networks know that this effect is quite big . Generally, it is either fixed or fixed lr, Or with the training lr Gradually smaller

Scheme 1 : When the verification error no longer decreases ,lr Reduce to the original 0.5

Option two : The reduction ratio of convergence can be guaranteed theoretically ,O(1/t),t Is the number of iterations

Option three : It is best to use an adaptive learning rate , such as Adagrad(Duchi et al. 2010) etc.

Briefly ,Adagrad It is very suitable for models with different data occurrence frequencies , such as word2vec, You certainly want to see very few words with very large weight updates , Keep them away from conventional words , Learn the meaning of distance measurement in vector space , There are many words (the,very,often) Each update is small .

ed3e27d235ae66d83eec99b511e8d037.png

adagrad

According to the formula above , If you enter a local optimum, Parameters may not be updated , Consider every other paragraph epoch,reset sum term

  • See if your model has the ability to fit !(training error vs. validation error)

without , Find a way to over fit it !(r u kidding?! ha-ha ), generally speaking , When There are more parameters than training Data time , Models are capable of remembering data , Is the ability of the model guaranteed first

If it is fitted , Then it can be further optimized , General deep learning breakthrough All the methods come from better regularization method, There are many methods to solve the problem of over fitting, which are not discussed here . such as Reduce the model (layers, units);L1,L2 Regular (weight decay);early stop( According to the data set size , Every once in a while epoch( For example, small data sets are set every 5epoch, Big every (1/3epoch)) Save the model , The final choice validation error The smallest model );sparsity constraints on hidden activation;Dropout;data

augumentation (CNN Some change invariance should be paid attention to ) etc.

------

The general process is as above , Another great work of God

Practical Recommendations for Gradient-Based Training of Deep Architectures Y. Bengio(2012)

https://link.springer.com/chapter/10.1007/978-3-642-35289-8_26

Additional mention is unsupervised Preliminary training . In fact, when the data is not enough, you can also find similar tasks to do migration learning ,fine-tuning etc. .


Last , You can see so many super parameters of a network , How to choose these super parameters ? The article also says :Random hyperparameter search!


Most of the above mentioned are supervised learning, about unsupervised learning You can do it fine tuning

Next, I will list some modules , Welcome to add !!

    Standardization (Normalization)

quite a lot machine learning Models need to , I will not discuss more here , Neural network hypothesis inputs/outputs The approximate mean value is 0 The variance of 1 Distribution . Mainly to treat each characteristic fairly ; Make the optimization process smooth ; Eliminate dimensional effects, etc

z-score; min-max; decimal scaling etc.

  • scale The importance of control features : Big scale Of output Features produce greater error; Big scale Of input The characteristic of the dominant network can make the network more sensitive to this characteristic , Produce big update

  • Some features originally have a small value range, which requires special attention , Avoid producing NaNs

  • Even without standardizing your network to train , Maybe the first few floors do similar things , Invisibly increases the complexity of the network

  • It's usually all inputs The characteristics of are independently standardized according to the same rules , If there are special needs for the task , Certain characteristics can be treated specially  

    Examination result (Results Check)

It is a bit similar to a monitoring system in a model ( Preprocessing , Training , The prediction process should be ), This step can help you find out what went wrong with your model , It is better to find a visual method , Be clear at a glance , For example, the image is very intuitive .

  • It should be noted that , You need to understand what you set up error The meaning of , Even if the training process error Is decreasing , also Need to come with the real error Compare , although training error Less , But it may not be enough , The real world needs smaller error, It shows that the model learning is not enough

  • When stay training In the process work after , Then go to see in validation The effect on the set

  • Before updating the network structure , best Make sure that every link has “ monitor ”, Don't blindly do useless work

    Preprocessing (Pre-Processing Data)

In reality, the same data can be expressed in different ways , Like a mobile car , You look at it from different angles , It does the same thing . You should make sure that the same data is observed from the South and from the West , It should be similar !

  • The neural network assumes that the data distribution space is continuous

  • Reduce the error caused by the diversity of data representation ; Indirectly reduces unnecessary tasks in the first few layers of the network “ equivalent ” The complexity of mapping

    Regularization (Regularization)

increase Dropout, random process , noise ,data augumentation etc. . Even if there is enough data , You think it impossible over-fitting, So it's better to have regular , Such as dropout(0.99)

  • On the one hand, it alleviates over fitting , On the other hand, the introduction of randomness , It can smooth the training process , Speed up the training process , Handle outliers

  • Dropout Can be seen as ensemble, Feature sampling , amount to bagging Many subnetworks ; Dynamic expansion during training has a similar variation Input dataset for .( In a single layer network , Like a compromise Naiive bayes( All feature weights are independent ) and logistic regression( There is a relationship between all the features );

  • Generally, the more complex the large-scale network ,Dropout The better the result. , It's a strong one regularizer!

  • The best way to prevent over-fitting There is a lot of non duplicate data

   Batch Size Too big

Too big batch size Will be reduced gradient descend The randomness of , Have a negative impact on the accuracy of the model .
If you can tolerate too long training time , It's best to start using as small as possible batch size(16,8,1)

  • Big batch size More is needed epoch To reach a better level

  • reason 1: Help you jump out during your workout local minima

  • reason 2: Let the training enter a more gentle local minima, Improve generalization

    Learning rate lr

Get rid of gradient clipping( Generally, there are... By default ), During training , Find the biggest , Make model error It won't explode lr, Then use a smaller one lr Training

  • In general data outliers Will produce large error, And then big gradient, Get big weight update, Will make the best lr It's hard to find

  • Preprocess the data ( Remove outliers),lr It is generally unnecessary to set clipping

  • If error explode, Then add gradient clipping It's just temporary relief , The reason is still data problems   

  The activation function of the last layer

Limit the range of output , Generally, no activation is required


You need to think carefully about what the input is , The value range of output after standardization , If the output is positive or negative , Do you use ReLU,sigmoid Obviously not ; Multi category tasks are generally used softmax( It is equivalent to normalizing the output to a probability distribution )

  • Activation is just a mapping , In theory

  • If the output does not error Obviously not , There is no gradient, The model doesn't learn anything

  • It's usually used tanh, A problem arises when the gradient is -1 or 1 It's very small nearby , Neuron saturation learning is slow , Easy to generate gradient messages , The model produces more approximations -1 or 1 Value

   Bad Gradient(Dead Neurons)

Use ReLU Activation function , Because its gradient is less than zero 0, May affect model performance , Even the model will not be updated


When the model is found to follow epoch Conduct , Training error Don't change , Maybe all the neurons “ die ” 了 . At this time, try to change the activation function, such as leaky ReLU,ELU, Look at the training error change

  • Use ReLU It is necessary to add a little noise to the parameters , Break the complete symmetry to avoid 0 gradient , Even to biases Add noise

  • Relatively speaking, for sigmoid, Because it's in 0 Value is the most sensitive , Gradient maximum , Initialize all to 0 That's all right.

  • Any operation on gradients , such as clipping, rounding, max/min May cause similar problems

  • ReLU relative Sigmoid advantage : Unilateral inhibition ; Wide excitement boundary ; Sparse activation ; The solution gradient disappears

    Initialization weight

Generally speaking, it is randomly initialized to some small numbers , It's not that simple , Some network structures require some specific initialization methods , If the initialization is not good, you may not get the effect of the article ! You can try some popular methods to find useful initialization

  • Too small : The signal transmission gradually shrinks and is difficult to work

  • Too big : The gradual amplification of signal transmission leads to divergence and failure

  • The more popular ones are 'he', 'lecun', 'Xavier'( Let the weight satisfy 0 mean value ,2/( Enter the number of nodes + Number of output nodes ))

  • biases It is usually initialized to 0 Can

  • Each layer initialization is important

   The Internet is too deep

It is said that the depth network has higher accuracy , But the depth is not piled up blindly , It must be based on the effect of shallow network , Increase the depth . The depth increase is to increase the accuracy of the model , If you can't learn anything in the shallow , It won't work if it's too deep .


Start with 3-8 layer , When the effect is good , In order to get higher accuracy , Try to deepen the network

  • So the optimization method is also useful in shallow layer , If the effect is not good , Definitely not deep enough

  • The training and prediction process slows down as the network deepens

   Hidden neurons The number of


It's best to refer to researcher Structure on similar tasks , commonly 256-1024


Too much : Slow training , Difficult to remove noise (over-fitting)

too little : The fitting ability decreases

  • Consider how much information a real variable has to pass , Then add a little more ( consider dropout; Redundant expression ; Room for estimation )

  • Classification task : Initial attempt 5-10 Times the number of categories

  • Return to the task : Initial attempt 2-3 Double input / Output characteristic number

  • Intuition is important here

  • The final impact is actually small , But the training process is slow , Try more  

   loss function

Multi category tasks are generally used cross-entropy no need MSE

  • Multi classification is generally used softmax, In less than 0 The gradient in the range is very small , Add one more log Can improve this problem

  • avoid MSE Resulting in a decrease in the learning rate , The learning rate is controlled by the output error ( Just push it yourself )

   AE Dimension reduction

Use... For intermediate hidden layers L1 Regular , The sparsity of hidden nodes is controlled by penalty coefficient

   SGD

Unstable algorithm , Set different learning rates , There is a big gap in the results , It needs careful adjustment

Generally, I hope to start a big , Accelerate convergence , Later stage is small , The stability falls into the local optimal solution .


Adaptive algorithms can also be used ,Adam,Adagrad,Adadelta So as to reduce the burden of adjusting parameters ( Generally, you can use the default value )

  • about SGD Need to know the learning rate ,Momentum,Nesterov And so on

  • It is worth mentioning that many local optimal solutions of neural networks may achieve better results , However, the global optimal solution is easy to over fit  

   CNN Use

Neural network is a feature learning method , Its ability depends on the hidden layer , More connections mean an explosion of parameters , The complexity of the model directly leads to many problems . Such as serious over fitting , High computational complexity .


CNN Its superior performance is well worth using , The number of parameters is only related to the size of convolution kernel , Quantity is related to , Ensure the number of hidden nodes ( Related to convolution step size ) At the same time , The number of parameters is greatly reduced ! Of course CNN More for images , Other tasks are abstracted by yourself , Try more !


Here is a brief introduction CNN Of trick

  • pooling Or the convolution size and step size are different , Increase data diversity

  • data augumentation, Avoid overfitting , Improve generalization , Add noise disturbance

  • weight regularization

  • SGD Use decay Training methods

  • Finally using pooling(avgpooling) Instead of full connection , Reduce the number of parameters

  • maxpooling Instead of avgpooling, avoid avgpooling The blurring effect

  • 2 individual 3x3 Replace a 5x5 etc. , Reduce parameters , Add nonlinear mapping , send CNN Strong ability to learn characteristics

  • 3x3,2x2 window

  • Pre training methods, etc

  • After data preprocessing (PCA,ZCA) Feed model

  • Output results window ensemble

  • The intermediate node acts as the auxiliary output node , Equivalent to model fusion , At the same time, the back-propagation gradient signal is added , Provides additional regularization

  • 1x1 Convolution , Boast channel organization information , Improve network expression , The dimension of output can be reduced , Low cost , High cost performance , Add nonlinear mapping , accord with Hebbian principle

  • NIN Increase the adaptability of the network to different scales , similar Multi-Scale thought

  • Factorization into small convolution,7x7 use 1x7 and 7x1 Instead of , Save parameters , Add nonlinear mapping

  • BN Reduce Internal Covariance Shift problem , Improve learning speed , Reduce overfitting , May cancel dropout, Increase the learning rate , Lighten regularity , Data enhancement to reduce optical distortion

  • When the model encounters degradation problems, consider shortcut structure , Increase the depth

  • wait

   RNN Use

Small details are very similar to others , Simply say two other aspects of personal feelings , Actually RNN It's also shortcut structure

  • It's usually used LSTM The structure prevents BPTT The gradient disappears ,GRU Have fewer parameters , Priority can be given to

  • Pretreatment details ,padding, Sequence length setting , Rare word processing, etc

  • The data volume of a general language model must be very large

  • Gradient Clipping

  • Seq2Seq Structural considerations attention, Premise: large amount of data

  • The sequence model has excellent performance CNN+gate structure

  • The general generation model can refer to GAN,VAE, Generate random variables

  • RL The combination of the framework of

  • Small amount of data, simple MLP

  • Prediction uses hierarchical structure to reduce training complexity

  • Design sampling methods , Increase the convergence rate of the model

  • Add multiple levels shortcut structure

That's it today , Welcome to add .

The good news !

Xiaobai learns visual knowledge about the planet

Open to the outside world

f8b241bbfdeb7cea0d2c771a41c27933.png

 download 1:OpenCV-Contrib Chinese version of extension module 

 stay 「 Xiaobai studies vision 」 Official account back office reply : Extension module Chinese course , You can download the first copy of the whole network OpenCV Extension module tutorial Chinese version , Cover expansion module installation 、SFM Algorithm 、 Stereo vision 、 Target tracking 、 Biological vision 、 Super resolution processing and other more than 20 chapters .


 download 2:Python Visual combat project 52 speak 
 stay 「 Xiaobai studies vision 」 Official account back office reply :Python Visual combat project , You can download, including image segmentation 、 Mask detection 、 Lane line detection 、 Vehicle count 、 Add Eyeliner 、 License plate recognition 、 Character recognition 、 Emotional tests 、 Text content extraction 、 Face recognition, etc 31 A visual combat project , Help fast school computer vision .


 download 3:OpenCV Actual project 20 speak 
 stay 「 Xiaobai studies vision 」 Official account back office reply :OpenCV Actual project 20 speak , You can download the 20 Based on OpenCV Realization 20 A real project , Realization OpenCV Learn advanced .


 Communication group 

 Welcome to join the official account reader group to communicate with your colleagues , There are SLAM、 3 d visual 、 sensor 、 Autopilot 、 Computational photography 、 testing 、 Division 、 distinguish 、 Medical imaging 、GAN、 Wechat groups such as algorithm competition ( It will be subdivided gradually in the future ), Please scan the following micro signal clustering , remarks :” nickname + School / company + Research direction “, for example :” Zhang San  +  Shanghai Jiaotong University  +  Vision SLAM“. Please note... According to the format , Otherwise, it will not pass . After successful addition, they will be invited to relevant wechat groups according to the research direction . Please do not send ads in the group , Or you'll be invited out , Thanks for your understanding ~
原网站

版权声明
本文为[Xiaobai learns vision]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206220938464686.html