当前位置：网站首页>Summary of regularization methods

Summary of regularization methods

2022-06-25 15:13:00 【m0_ sixty-one million eight hundred and ninety-nine thousand on】

Preface

This article comes from [Deep Learning] Regularization - Poll The notes - Blog Garden

In summary, regularization （Regularization） Before , Let's talk about what regularization is first , Why regularize .

I think regularization is a bit too abstract and broad , Actually The nature of regularization is simple , It is a means or operation to restrict or restrict a problem a priori in order to achieve a specific purpose . The purpose of using regularization in the algorithm is to prevent over fitting of the model . When it comes to regularization , Many students may immediately think of the commonly used L1 Norm sum L2 norm , Before summarizing , Let's take a look first LP What the hell is the norm .

Catalog

L1 Norm sum L2 The difference of norm

Dropout

Batch Normalization

normalization 、 Standardization & Regularization

Reference article

LP norm

Norm is simple and can be understood as representing distance in vector space , And the definition of distance is very abstract , As long as it's not negative 、 introspect 、 Trigonometric inequality can be called distance .

LP Norm is not a norm , It's a set of norms , Its definition is as follows ：

p The range is [1,+∞].p stay (0,1) It's not the norm defined in the scope , Because it violates the trigonometric inequality .

according to p The change of , The norm also has different changes , Borrow a classic about P The change of norm is shown as follows ：

The figure above shows p from 0 When it comes to positive infinity , Unit ball （unit ball） The change of . stay P The unit ball defined by norm is convex set , But when 0<p<1 when , The unit sphere under this definition is not a convex set （ We mentioned this before , When 0<p<1 Time is not a norm ）.

That's the question ,L0 What is a norm ？

L0 Norm is the number of non-zero elements in a vector , The formula is as follows ：

We can minimize L0 norm , To find the least optimal sparse features . But unfortunately ,L0 The optimization problem of norm is a NP hard problem （L0 Norms are also nonconvex ）. therefore , In practical application, we often pay attention to L0 Convex relaxation , There is a theoretical proof ,L1 Norm is L0 Optimal convex approximation of norm , So we usually use L1 Norm instead of direct optimization L0 norm

L1 norm

according to LP We can easily get the definition of norm L1 The mathematical form of norm ：

It can be seen from the above formula that ,L1 Norm is the sum of the absolute values of the elements of a vector , Also known as " Sparse regular operators "（Lasso regularization）. So here comes the question , Why do we want to be sparse ？ Sparseness has many advantages , The most direct two ：

feature selection
Interpretability

L2 norm

L2 Norm is the most familiar , It's Euclidean distance , The formula is as follows ：

L2 Norms have many names , Some people call its return “ Ridge return ”（Ridge Regression）, It's also called “ Weight attenuation ”（Weight Decay）. With L2 The dense solution can be obtained by taking norm as regular term , That is, the parameters corresponding to each feature w Very small , Close to the 0 But not for 0; Besides ,L2 Norm as regularization term , It can prevent the model from being too complex to fit the training set , So as to improve the generalization ability of the model .

L1 Norm sum L2 The difference of norm

introduce PRML A classic diagram to illustrate L1 and L2 The difference of norm , As shown in the figure below ：

As shown in the figure above , The blue circle indicates the possible solution range of the problem , Orange denotes the possible solution range of the regular term . And the whole objective function （ The original question + The regularization ） There are solutions if and only if the ranges of two solutions are tangent . It's easy to see from the picture above , because L2 The range of norm solution is circle , So the point of tangency is very likely not on the axis , And because the L1 The norm is a diamond （ The vertex is convex ）, The point of tangency is more likely to be on the axis , And the point on the axis has one characteristic , Only one coordinate component is not zero , Other coordinate components are zero , It's sparse . So here's the conclusion ,L1 Norms can lead to sparse solutions ,L2 Norm leads to dense solution .

From the perspective of Bayesian priors , When training a model , It's not enough to rely on the current training data set , In order to achieve better generalization ability , It is often necessary to add a priori term , Adding a regular term is equivalent to adding a priori .

L1 The norm is equivalent to adding a Laplacean transcendental ;
L2 The norm is equivalent to adding a Gaussian transcendental .

As shown in the figure below ：

Dropout

Dropout It is a regularization method often used in deep learning . Its practice can be simply understood as in DNNs In the process of training with probability p Discard some neurons , Even if the output of the discarded neurons is 0.Dropout The following figure can be instantiated ：

We can understand it intuitively from two aspects Dropout The regularization effect of ：

stay Dropout In each round of training, the operation of randomly losing neurons is equivalent to many DNNs Take the average , Therefore, it has vote The effect of .
Reduce complex co adaptation between neurons . When the hidden layer neurons are randomly deleted , It makes the fully connected network have a certain degree of sparseness , So as to effectively reduce the synergistic effect of different characteristics . in other words , Some features may depend on the interaction of hidden nodes of fixed relationships , And by Dropout Words , It effectively organizes the situation where some features have effect only in the presence of other features , The robustness of neural network is increased .

Batch Normalization

Batch normalization （Batch Normalization） Strictly speaking, it belongs to normalization means , Mainly used to accelerate the convergence of the network , But it also has a certain degree of regularization effect .

Here is a reference to Dr. Wei Xiushen's Zhihu answer covariate shift The explanation of .

Note: the following is quoted from Dr. Wei Xiushen's reply ：

As we all know, a classical hypothesis in statistical machine learning is “ Source space （source domain） And target space （target domain） Data distribution of （distribution） It's consistent ”. If it's not consistent , So there are new machine learning problems , Such as transfer learning/domain adaptation etc. . and covariate shift It's a branch problem under the hypothesis of inconsistent distribution , It means that the conditional probabilities of source space and target space are the same , But the marginal probability is different . If you think about it, you will find , You bet , For each layer output of neural network , Because they are operated in layers , The distribution is obviously different from the input signal distribution of each layer , And the difference will increase with the depth of the network , But what they can “ instructions ” Sample label of （label） It's still the same , This is in line with covariate shift The definition of .
BN In fact, the basic idea of is quite intuitive , Because the active input value of neural network before nonlinear transformation （X = WU + B,U It's input ）, As the network deepens , Its distribution is gradually shifted or changed （ That is to say covariate shift）. The reason why training convergence is slow , Generally, the whole distribution is gradually close to the upper and lower limits of the value range of the nonlinear function （ about Sigmoid In terms of functions , Means activate the input value （X = WU + B） Are large negative and positive values . So this causes the gradient of the lower layer neural network to disappear when it propagates backward , This is the essential reason why the convergence of deep neural network is getting slower and slower . and BN It is through certain standardized means , Force the distribution of the input value of any neuron in each layer of neural network back to the mean value 0 The variance of 1 The standard normal distribution of , Avoid the gradient dispersion problem caused by the activation function . So it's not so much BN The effect of the treatment is to alleviate covariate shift, Might as well say BN It can alleviate the problem of gradient dispersion .

normalization 、 Standardization & Regularization

Regularization we have mentioned , Here is a brief introduction to normalization and standardization .

normalization （Normalization）： The goal of normalization is to find some kind of mapping relationship , Map the raw data to [a,b] On interval . commonly a,b Will take [-1,1],[0,1] These combinations .

There are generally two application scenarios ：

Change the number into (0, 1) Decimal between
Convert a dimensional number into a dimensionless number

Commonly used min-max normalization：

Standardization （Standardization）： Use the large number theorem to transform the data into a standard normal distribution , The standardized formula is ：

The difference between normalization and Standardization ：

We can explain this simply ：

What is the normalized scaling “ Beat flat ” Unified to the interval （ Only by the extreme value ）, And standardized scaling is more “ elastic ” and “ dynamic ” Of , It has a lot to do with the distribution of the whole sample .

It is worth noting that ：

normalization ： Zoom is just the same as the maximum 、 It's about the difference in the minimum .

Standardization ： Scaling has something to do with every point , Through variance （variance） reflected . Compared with normalization , All data points in standardization contribute （ Through mean and standard deviation ）.

Why standardization and normalization ？

Improve model accuracy ： After normalization , The characteristics of different dimensions are numerically comparable , Can greatly improve the accuracy of the classifier .
Accelerate model convergence ： After standardization , The optimization process of the optimal solution will obviously become smooth , It's easier to converge to the optimal solution correctly . As shown in the figure below ：