当前位置：网站首页>Nfnet: extension of NF RESNET without BN's 4096 super batch size training | 21 year paper

Nfnet: extension of NF RESNET without BN's 4096 super batch size training | 21 year paper

2022-06-23 15:34:00 【VincentLee】

The paper holds that Batch Normalization It is not a necessary structure of the network , On the contrary, it will bring many problems , So I started to study Normalizer-Free The Internet , It is hoped that it can not only have considerable performance, but also support large-scale training . The paper proposes ACG Gradient clipping method to assist training , It can effectively prevent gradient explosion , Also based on NF-ResNet Your thoughts will SE-ResNet Transform into NFNet series , have access to 4096 It's huge batch size Training , Performance goes beyond Efficient series
source ： Xiaofei's algorithm Engineering Notes official account

The paper : High-Performance Large-Scale Image Recognition Without Normalization

Address of thesis ：https://arxiv.org/abs/2102.06171
Paper code ：https://github.com/deepmind/deepmind-research/tree/master/nfnets

Introduction

at present , Most models of computer vision benefit from depth residual networks and batch normalization, These two innovations can help train deeper networks , It achieves high accuracy in training set and test set . especially batch normalization, It can not only smooth the loss curve , Use greater learning rates and batch size Training , And regularization . However ,batch normalization Not perfect ,batch normalization There are three shortcomings in practice ：

It costs a lot of calculation , Memory consumption is high .
There is a discrepancy in the use of training and reasoning , And bring in extra super parameters .
Broke the training set minibatch Independence .

among , The third problem is the most serious , This will cause a series of negative problems . First ,batch normalization This makes it difficult for the model to reproduce the accuracy on different devices , And distributed training often has problems . secondly ,batch normalization It cannot be used in tasks requiring independent training samples for each round , Such as GAN and NLP Mission . Last ,batch normalization Yes batch size Very sensitive , stay batch size Low performance , Limited model size on limited devices .

therefore , Even though batch normalization It has a very powerful effect , Some researchers are still looking for a simple alternative , Not only does it need to be fairly accurate , It can also be used in a wide range of tasks . at present , Most of the alternatives focus on suppressing the weight of the residual branch , For example, a learnable scalar with zero initial value is introduced at the end of the residual branch . But these methods are not accurate enough , It just can't be used for large-scale training , The accuracy is always inferior to EfficientNets.

thus , The paper is mainly based on the previous substitution batch normalization The job of , Try to solve the core problem , The main contributions of this paper are as follows ：

Put forward Adaptive Gradient Clipping(AGC), In dimensions , Based on the proportion of Weight Norm and gradient norm, gradient clipping is carried out . take AGC Used for training Normalizer-Free The Internet , Use larger batch size And stronger data enhancements .
Design Normalizer-Free ResNets series , Name it NFNets, stay ImageNet Up to SOTA, among NFNet-F1 And EfficientNet-B7 The precision is equal , Fast training 8.7 times , maximal NFNet Accessible 86.5%top-1 Accuracy rate .
Experimental proof , stay 3 After pre training on the private data set of 100 million tags , And then ImageNet on finetune, The accuracy is comparable to batch normalization The network should be high , The best model achieves 89.2%top-1 Accuracy rate .

Understanding Batch Normalization

This paper discusses batch normalization Several advantages of , Let's talk about it briefly ：

downscale the residual branch：batch normalization It limits the weight of the residual branch , Deflect the signal skip path Direct transmission , Help train ultra deep networks .
eliminate mean-shift： The activation function is asymmetric and its mean value is non-zero , The eigenvalues activated at the initial stage of training will become larger and positive ,batch normalization Just can eliminate this problem .
regularizing effect： because batch normalization Training is done with minibatch Statistics , It's equivalent to the present batch Noise is introduced , Play the role of regularization , Can prevent over fitting , Improve accuracy .
allows efficient large-batch training：batch normalization Can smooth loss curve , Greater learning rates and bach size Training .

Towards Removing Batch Normalization

The research of this paper is based on the author's previous Normalizer-Free ResNets(NF-ResNets) Expand ,NF-ResNets Getting rid of normalization There is still quite good training and test accuracy after the layer .NF-ResNets The core of is to adopt $h{i+1}=h_i+\alpha f_i(h_i/\beta_i)$ Formal residual block,$h_i$ For the first time $i$ Input of residual blocks ,$f_i$ For the first time $i$ individual residual block Residual branch of .$f_i$ Special initialization is required , It has the function of keeping the variance constant , namely $Var(f_i(z))=Var(z)$.$\alpha=0.2$ Used to control the variance variation ,$\beta_i=\sqrt{Var(h_i)}$ by $h_i$ Standard deviation . after NF-ResNet Of residual block After processing , The variance of the output becomes $Var(h{i+1})=Var(h_i)+\alpha^2$.

Besides ,NF-ResNet Another core of is Scaled Weight Standardization, Used to solve the problems caused by the activation layer mean-shift The phenomenon , Reinitialize the convolution layer with the following weights ：

among ,$\mui=(1/B)\sum_jW{ij}$ and $\sigma^2i=(1/N)\sum_j(W{ij}-\mu_i)^2$ Is a line corresponding to the convolution kernel (fan-in) The mean and variance of . in addition , The output of the nonlinear activation function needs to be multiplied by a specific scalar $\gamma$, The combination of the two ensures that the variance does not change .

Previously published articles also have NF-ResNet A detailed interpretation of , Those who are interested can go and have a look .

Adaptive Gradient Clipping for Efficient Large-Batch Training

Gradient tailoring can help training use a greater learning rate , It can also accelerate convergence , Especially when the loss curve is not ideal or large batch size Training scenario . therefore , The paper thinks that gradient clipping can help NF-ResNet Large adaptation batch size Training scene . For gradient vectors $G=\partial L/\partial\theta$, The standard gradient clipping is ：

Crop threshold $\lambda$ It is a super parameter that needs debugging . Based on experience , Although gradient tailoring can help training with larger batch size, But the effect of the model on the threshold $\lambda$ The setting of is very sensitive , Depending on the depth of the model 、batch size And learning rate . therefore , The paper puts forward a more convenient Adaptive Gradient Clipping(AGC).

Definition $W^l\in\mathbb{R}^{N\times M}$ and $G^l\in\mathbb{R}^{N\times M}$ by $l$ Layer weight matrix and gradient matrix ,$|\cdot|_F$ by F- norm ,ACG The algorithm uses the ratio between gradient norm and weight norm $\frac{|G^l|_F}{|W^l|_F}$ For dynamic gradient clipping . In practice , It is found that according to convolution kernel (unit-wise) The effect of gradient clipping is better than that of the whole convolution kernel , Final ACG Algorithm for ：

Crop threshold $\lambda$ Is a super parameter , Set up $|W_i|^{*}_F=max(|W_i|_F, \epsilon=10^{-3})$, Avoid zero initialization , The parameter always cuts the gradient to zero . With the help of AGC Algorithm ,NF-ResNets You can use bigger batch size(4096) Training , You can also use more complex data enhancements . The most optimal $\lambda$ Consider the optimizer 、 Learning rate and batch size, Discover through practice , The bigger batch size The smaller $\lambda$, such as batch size=4096 Use $\lambda=0.01$.

ACG The algorithm is similar to the optimizer normalization , such as LARS.LARS The norm of the weight update value is fixed as the ratio of the weight norm $\Delta w=\gamma \eta \frac{|w^l|}{|\nabla L(w^l)|} * \nabla L(w^l_t)$, Thus ignoring the magnitude of the gradient , Only the gradient direction is preserved , It can alleviate the phenomenon of gradient explosion and gradient disappearance .ACG The algorithm can be considered as a relaxed version of the optimizer normalization , The maximum gradient is constrained based on the weight norm , But it does not constrain the lower bound of the gradient or ignore the magnitude of the gradient . The paper also tries ACG and LARS Use it together , I found that the performance decreased instead .

Normalizer-Free Architectures with Improved Accuracy and Training Speed

The paper takes GELU Activation of SE-ResNeXt-D Model as Normalizer-Free The foundation of the Internet , In addition to training, join ACG Outside , The main improvements are as follows ：

take $3\times 3$ Convolution becomes grouping convolution , The dimensions of each group are fixed as 128, The number of groups is determined by the input dimension of the convolution . Smaller grouping dimensions can reduce the amount of theoretical calculation , However, the reduction of computing density leads to the inability to make good use of the advantages of device dense computing , It doesn't actually bring more acceleration .
ResNet Deep expansion of ( from resnNet50 Extended to ResNet200) Mainly focused on stage2 and stage3, and stage1 and stage4 keep 3 individual block In the form of . Such an approach is not optimal , Because regardless of low-level features or high-level features , Need enough space to learn . therefore , The paper first formulates the minimum F0 The Internet stage Of block The number of $1,2,6,3$, Subsequent larger networks are expanded in multiples on this basis .
ResNet Of stage Dimension for $256,512,1024,2048$, After testing , Change it to $256,512,1536,1536$,stage3 Adopt larger capacity , Because it's deep enough , Need more capacity to collect features , and stage4 The main reason for not increasing the depth is to maintain the training speed .

take NF-ResNet Of bottleneck residual block Applied to the SE-ResNeXt And modify , In addition to the original foundation, a $3\times 3$ Convolution , There is only a small increase in the amount of calculation .
Build a scaling strategy to produce models of different computing resources , It is found that width expansion has little effect on network gain , So only the scaling of depth and input resolution is considered . As I said before , Deep expansion of the basic network in the form of multiples , Scale resolution at the same time , Make its training and testing speed reach half of the previous level .
When the network volume increases , Strengthen regularization strength . It is found through experiments that , adjustment weight decay and stochastic depth rate（ The training process makes some random block The residual branch of is invalid ） There are no big gains , So by increasing dropout Of drop rate To achieve the purpose of regularization . Due to lack of network BN Display regularization of , So this step is very important , Prevent over fitting .

According to the above modification , It is concluded that the NFNet The parameters of the series are shown in table 1 Shown . There is a global pooling layer at the end of the network , So the resolution of training and testing can be different .

Experiment

contrast AGC In different batch size The next effect , as well as $\lambda$ And batch size The relationship between .

stay ImageNet Compare the performance of networks of different sizes .

be based on ImageNet Of 10 epoch Pre training weights , Conduct NF-ResNet Transform and Fine-tuning, The performance is shown in the table 4 Shown .

Conclusion

The paper holds that Batch Normalization It is not a necessary structure of the network , On the contrary, it will bring many problems , So I started to study Normalizer-Free The Internet , It is hoped that it can not only have considerable performance, but also support large-scale training . The paper proposes ACG Gradient clipping method to assist training , It can effectively prevent gradient explosion , Also based on NF-ResNet Your thoughts will SE-ResNet Transform into NFNet series , have access to 4096 It's huge batch size Training , Performance goes beyond Efficient series .

原网站

版权声明
本文为[VincentLee]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/174/202206231334174183.html