当前位置:网站首页>Summary of regularization methods
Summary of regularization methods
2022-06-25 15:13:00 【m0_ sixty-one million eight hundred and ninety-nine thousand on】
Preface
This article comes from [Deep Learning] Regularization - Poll The notes - Blog Garden
In summary, regularization (Regularization) Before , Let's talk about what regularization is first , Why regularize .
I think regularization is a bit too abstract and broad , Actually The nature of regularization is simple , It is a means or operation to restrict or restrict a problem a priori in order to achieve a specific purpose . The purpose of using regularization in the algorithm is to prevent over fitting of the model . When it comes to regularization , Many students may immediately think of the commonly used L1 Norm sum L2 norm , Before summarizing , Let's take a look first LP What the hell is the norm .
Catalog
L1 Norm sum L2 The difference of norm
normalization 、 Standardization & Regularization
LP norm
Norm is simple and can be understood as representing distance in vector space , And the definition of distance is very abstract , As long as it's not negative 、 introspect 、 Trigonometric inequality can be called distance .
LP Norm is not a norm , It's a set of norms , Its definition is as follows :

p The range is [1,+∞].p stay (0,1) It's not the norm defined in the scope , Because it violates the trigonometric inequality .
according to p The change of , The norm also has different changes , Borrow a classic about P The change of norm is shown as follows :

The figure above shows p from 0 When it comes to positive infinity , Unit ball (unit ball) The change of . stay P The unit ball defined by norm is convex set , But when 0<p<1 when , The unit sphere under this definition is not a convex set ( We mentioned this before , When 0<p<1 Time is not a norm ).
That's the question ,L0 What is a norm ?
L0 Norm is the number of non-zero elements in a vector , The formula is as follows :

We can minimize L0 norm , To find the least optimal sparse features . But unfortunately ,L0 The optimization problem of norm is a NP hard problem (L0 Norms are also nonconvex ). therefore , In practical application, we often pay attention to L0 Convex relaxation , There is a theoretical proof ,L1 Norm is L0 Optimal convex approximation of norm , So we usually use L1 Norm instead of direct optimization L0 norm
L1 norm
according to LP We can easily get the definition of norm L1 The mathematical form of norm :

It can be seen from the above formula that ,L1 Norm is the sum of the absolute values of the elements of a vector , Also known as " Sparse regular operators "(Lasso regularization). So here comes the question , Why do we want to be sparse ? Sparseness has many advantages , The most direct two :
feature selection
Interpretability
L2 norm
L2 Norm is the most familiar , It's Euclidean distance , The formula is as follows :

L2 Norms have many names , Some people call its return “ Ridge return ”(Ridge Regression), It's also called “ Weight attenuation ”(Weight Decay). With L2 The dense solution can be obtained by taking norm as regular term , That is, the parameters corresponding to each feature w Very small , Close to the 0 But not for 0; Besides ,L2 Norm as regularization term , It can prevent the model from being too complex to fit the training set , So as to improve the generalization ability of the model .
L1 Norm sum L2 The difference of norm
introduce PRML A classic diagram to illustrate L1 and L2 The difference of norm , As shown in the figure below :

As shown in the figure above , The blue circle indicates the possible solution range of the problem , Orange denotes the possible solution range of the regular term . And the whole objective function ( The original question + The regularization ) There are solutions if and only if the ranges of two solutions are tangent . It's easy to see from the picture above , because L2 The range of norm solution is circle , So the point of tangency is very likely not on the axis , And because the L1 The norm is a diamond ( The vertex is convex ), The point of tangency is more likely to be on the axis , And the point on the axis has one characteristic , Only one coordinate component is not zero , Other coordinate components are zero , It's sparse . So here's the conclusion ,L1 Norms can lead to sparse solutions ,L2 Norm leads to dense solution .
From the perspective of Bayesian priors , When training a model , It's not enough to rely on the current training data set , In order to achieve better generalization ability , It is often necessary to add a priori term , Adding a regular term is equivalent to adding a priori .
L1 The norm is equivalent to adding a Laplacean transcendental ;
L2 The norm is equivalent to adding a Gaussian transcendental .
As shown in the figure below :

Dropout
Dropout It is a regularization method often used in deep learning . Its practice can be simply understood as in DNNs In the process of training with probability p Discard some neurons , Even if the output of the discarded neurons is 0.Dropout The following figure can be instantiated :

We can understand it intuitively from two aspects Dropout The regularization effect of :
stay Dropout In each round of training, the operation of randomly losing neurons is equivalent to many DNNs Take the average , Therefore, it has vote The effect of .
Reduce complex co adaptation between neurons . When the hidden layer neurons are randomly deleted , It makes the fully connected network have a certain degree of sparseness , So as to effectively reduce the synergistic effect of different characteristics . in other words , Some features may depend on the interaction of hidden nodes of fixed relationships , And by Dropout Words , It effectively organizes the situation where some features have effect only in the presence of other features , The robustness of neural network is increased .
Batch Normalization
Batch normalization (Batch Normalization) Strictly speaking, it belongs to normalization means , Mainly used to accelerate the convergence of the network , But it also has a certain degree of regularization effect .
Here is a reference to Dr. Wei Xiushen's Zhihu answer covariate shift The explanation of .
Note: the following is quoted from Dr. Wei Xiushen's reply :
As we all know, a classical hypothesis in statistical machine learning is “ Source space (source domain) And target space (target domain) Data distribution of (distribution) It's consistent ”. If it's not consistent , So there are new machine learning problems , Such as transfer learning/domain adaptation etc. . and covariate shift It's a branch problem under the hypothesis of inconsistent distribution , It means that the conditional probabilities of source space and target space are the same , But the marginal probability is different . If you think about it, you will find , You bet , For each layer output of neural network , Because they are operated in layers , The distribution is obviously different from the input signal distribution of each layer , And the difference will increase with the depth of the network , But what they can “ instructions ” Sample label of (label) It's still the same , This is in line with covariate shift The definition of .
BN In fact, the basic idea of is quite intuitive , Because the active input value of neural network before nonlinear transformation (X = WU + B,U It's input ), As the network deepens , Its distribution is gradually shifted or changed ( That is to say covariate shift). The reason why training convergence is slow , Generally, the whole distribution is gradually close to the upper and lower limits of the value range of the nonlinear function ( about Sigmoid In terms of functions , Means activate the input value (X = WU + B) Are large negative and positive values . So this causes the gradient of the lower layer neural network to disappear when it propagates backward , This is the essential reason why the convergence of deep neural network is getting slower and slower . and BN It is through certain standardized means , Force the distribution of the input value of any neuron in each layer of neural network back to the mean value 0 The variance of 1 The standard normal distribution of , Avoid the gradient dispersion problem caused by the activation function . So it's not so much BN The effect of the treatment is to alleviate covariate shift, Might as well say BN It can alleviate the problem of gradient dispersion .
normalization 、 Standardization & Regularization
Regularization we have mentioned , Here is a brief introduction to normalization and standardization .
normalization (Normalization): The goal of normalization is to find some kind of mapping relationship , Map the raw data to [a,b] On interval . commonly a,b Will take [-1,1],[0,1] These combinations .
There are generally two application scenarios :
Change the number into (0, 1) Decimal between
Convert a dimensional number into a dimensionless number
Commonly used min-max normalization:

Standardization (Standardization): Use the large number theorem to transform the data into a standard normal distribution , The standardized formula is :

The difference between normalization and Standardization :
We can explain this simply :
What is the normalized scaling “ Beat flat ” Unified to the interval ( Only by the extreme value ), And standardized scaling is more “ elastic ” and “ dynamic ” Of , It has a lot to do with the distribution of the whole sample .
It is worth noting that :
normalization : Zoom is just the same as the maximum 、 It's about the difference in the minimum .
Standardization : Scaling has something to do with every point , Through variance (variance) reflected . Compared with normalization , All data points in standardization contribute ( Through mean and standard deviation ).
Why standardization and normalization ?
Improve model accuracy : After normalization , The characteristics of different dimensions are numerically comparable , Can greatly improve the accuracy of the classifier .
Accelerate model convergence : After standardization , The optimization process of the optimal solution will obviously become smooth , It's easier to converge to the optimal solution correctly . As shown in the figure below :


Reference article
Andrew Ng In depth learning course
Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)
边栏推荐
- JS capture, target, bubble phase
- basic_ String mind map
- Daily question, Caesar code,
- One code per day - day one
- How to download and install Weka package
- If multiple signals point to the same slot function, you want to know which signal is triggered.
- 14 -- validate palindrome string II
- Common classes in QT
- Learning notes on February 8, 2022 (C language)
- Business layer - upper and lower computer communication protocol
猜你喜欢

Judging the number of leap years from 1 to N years

Sequential programming 1

What moment makes you think there is a bug in the world?

Afterword of Parl intensive learning 7-day punch in camp

Data feature analysis skills - correlation test

JS select all exercise

System Verilog - thread

Open a restaurant

Solution of push code failure in idea

Introduction to flexible array
随机推荐
Open a restaurant
System Verilog - function and task
Afterword of Parl intensive learning 7-day punch in camp
1090.Phone List
The best time to buy and sell stocks
(2) Relational database
From 408 to independent proposition, 211 to postgraduate entrance examination of Guizhou University
Usage of qlist
Why should the coroutine be set to non blocking IO
2022年广东高考分数线出炉,一个几家欢喜几家愁
Luogu p5707 [deep foundation 2. example 12] late for school
Is it safe to open an online stock account? Who knows
If multiple signals point to the same slot function, you want to know which signal is triggered.
System Verilog — interface
Gif动图如何裁剪?收下这个图片在线裁剪工具
QT loading third-party library basic operation
Study notes of cmake
How to combine multiple motion graphs into a GIF? Generate GIF animation pictures in three steps
Position (5 ways)
dev/mapper的解释