当前位置:网站首页>Fill the pit for repvgg? In fact, it is the repoptimizer open source of repvgg2

Fill the pit for repvgg? In fact, it is the repoptimizer open source of repvgg2

2022-06-23 09:39:00 Zhiyuan community

In the design of neural network structure , We often introduce some prior knowledge , such as ResNet Residual structure of . However, we still use the conventional optimizer to train the network . In this work , We propose to use prior information to modify gradient values , It is called gradient reparameterization , The corresponding optimizer is called RepOptimizer. We focus on VGG The straight cylinder model of , Train to get RepOptVGG Model , He has high training efficiency , Simple and direct structure and extremely fast reasoning speed .

Thesis link :https://arxiv.org/abs/2205.15242

Official warehouse :https://github.com/DingXiaoH/RepOptimizers

And RepVGG The difference between

  1. RepVGG A structural prior is added ( Such as 1x1,identity Branch ), And use the regular optimizer to train . and RepOptVGG It is Add this prior knowledge to the optimizer implementation
  2. Even though RepVGG In the reasoning stage, the branches can be fused , Become a straight tube model . however There are many branches in the training process , Need more memory and training time . and RepOptVGG But   really - Straight cylinder model , From the training process is a VGG structure
  3. We do this by customizing the optimizer , The equivalent transformation of structural reparameterization and gradient reparameterization is realized , This transformation is universal , It can be extended to more models

Introducing structural prior knowledge into the optimizer

We noticed a phenomenon , In special circumstances , Each branch contains a linearly trainable parameter , Add a constant scaling value , As long as the scaling value is set reasonably , The performance of the model will still be very high . We call this network block Constant-Scale Linear Addition(CSLA) Let's start with a simple CSLA Start with examples , Consider an input , after 2 A convolution branch + Linear scaling , And added to an output , We consider equivalent transformation into a branch , The equivalent transformation corresponds to 2 A rule :

Initialization rules

The weight of fusion shall be :

update rule

For the weight after fusion , The update rule is :

For this part of the formula, please refer to appendix A in , There is a detailed derivation. A simple example code is :

import torchimport numpy as npnp.random.seed(0)np_x = np.random.randn(1, 1, 5, 5).astype(np.float32)np_w1 = np.random.randn(1, 1, 3, 3).astype(np.float32)np_w2 = np.random.randn(1, 1, 3, 3).astype(np.float32)alpha1 = 1.0alpha2 = 1.0lr = 0.1conv1 = torch.nn.Conv2d(1, 1, kernel_size=3, padding=1, bias=False)conv2 = torch.nn.Conv2d(1, 1, kernel_size=3, padding=1, bias=False)conv1.weight.data = torch.nn.Parameter(torch.tensor(np_w1))conv2.weight.data = torch.nn.Parameter(torch.tensor(np_w2))torch_x = torch.tensor(np_x, requires_grad=True)out = alpha1 * conv1(torch_x) + alpha2 * conv2(torch_x)loss = out.sum()loss.backward()torch_w1_updated = conv1.weight.detach().numpy() - conv1.weight.grad.numpy() * lrtorch_w2_updated = conv2.weight.detach().numpy() - conv2.weight.grad.numpy() * lrprint(torch_w1_updated + torch_w2_updated)import torchimport numpy as npnp.random.seed(0)np_x = np.random.randn(1, 1, 5, 5).astype(np.float32)np_w1 = np.random.randn(1, 1, 3, 3).astype(np.float32)np_w2 = np.random.randn(1, 1, 3, 3).astype(np.float32)alpha1 = 1.0alpha2 = 1.0lr = 0.1fused_conv = torch.nn.Conv2d(1, 1, kernel_size=3, padding=1, bias=False)fused_conv.weight.data = torch.nn.Parameter(torch.tensor(alpha1 * np_w1 + alpha2 * np_w2))torch_x = torch.tensor(np_x, requires_grad=True)out = fused_conv(torch_x)loss = out.sum()loss.backward()torch_fused_w_updated = fused_conv.weight.detach().numpy() - (alpha1**2 + alpha2**2) * fused_conv.weight.grad.numpy() * lrprint(torch_fused_w_updated)

stay RepOptVGG in , Corresponding CSLA Blocks are RepVGG In the block 3x3 Convolution ,1x1 Convolution ,bn The layer is replaced by With learnable scaling parameters 3x3 Convolution ,1x1 Convolution Further expand to multi branch , hypothesis s,t Namely 3x3 Convolution ,1x1 Scaling coefficient of convolution , Then the corresponding update rule is :

The first formula corresponds to the input channel == Output channel , There is a total of 3 Branches , Namely identity,conv3x3, conv1x1 The second formula corresponds to the input channel != Output channel , At this time only conv3x3, conv1x1 The third formula of the two branches corresponds to other situations. It should be noted that CSLA No, BN This nonlinear operator during training (training-time nonlinearity), There is no non sequency (non sequential) Trainable parameter .

原网站

版权声明
本文为[Zhiyuan community]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/174/202206230932219409.html