当前位置:网站首页>Fill the pit for repvgg? In fact, it is the repoptimizer open source of repvgg2
Fill the pit for repvgg? In fact, it is the repoptimizer open source of repvgg2
2022-06-23 09:39:00 【Zhiyuan community】
In the design of neural network structure , We often introduce some prior knowledge , such as ResNet Residual structure of . However, we still use the conventional optimizer to train the network . In this work , We propose to use prior information to modify gradient values , It is called gradient reparameterization , The corresponding optimizer is called RepOptimizer. We focus on VGG The straight cylinder model of , Train to get RepOptVGG Model , He has high training efficiency , Simple and direct structure and extremely fast reasoning speed .

Thesis link :https://arxiv.org/abs/2205.15242
Official warehouse :https://github.com/DingXiaoH/RepOptimizers
And RepVGG The difference between
- RepVGG A structural prior is added ( Such as 1x1,identity Branch ), And use the regular optimizer to train . and RepOptVGG It is Add this prior knowledge to the optimizer implementation
- Even though RepVGG In the reasoning stage, the branches can be fused , Become a straight tube model . however There are many branches in the training process , Need more memory and training time . and RepOptVGG But really - Straight cylinder model , From the training process is a VGG structure
- We do this by customizing the optimizer , The equivalent transformation of structural reparameterization and gradient reparameterization is realized , This transformation is universal , It can be extended to more models

Introducing structural prior knowledge into the optimizer
We noticed a phenomenon , In special circumstances , Each branch contains a linearly trainable parameter , Add a constant scaling value , As long as the scaling value is set reasonably , The performance of the model will still be very high . We call this network block Constant-Scale Linear Addition(CSLA) Let's start with a simple CSLA Start with examples , Consider an input , after 2 A convolution branch + Linear scaling , And added to an output , We consider equivalent transformation into a branch , The equivalent transformation corresponds to 2 A rule :
Initialization rules
The weight of fusion shall be :

update rule
For the weight after fusion , The update rule is :

For this part of the formula, please refer to appendix A in , There is a detailed derivation. A simple example code is :
import torchimport numpy as npnp.random.seed(0)np_x = np.random.randn(1, 1, 5, 5).astype(np.float32)np_w1 = np.random.randn(1, 1, 3, 3).astype(np.float32)np_w2 = np.random.randn(1, 1, 3, 3).astype(np.float32)alpha1 = 1.0alpha2 = 1.0lr = 0.1conv1 = torch.nn.Conv2d(1, 1, kernel_size=3, padding=1, bias=False)conv2 = torch.nn.Conv2d(1, 1, kernel_size=3, padding=1, bias=False)conv1.weight.data = torch.nn.Parameter(torch.tensor(np_w1))conv2.weight.data = torch.nn.Parameter(torch.tensor(np_w2))torch_x = torch.tensor(np_x, requires_grad=True)out = alpha1 * conv1(torch_x) + alpha2 * conv2(torch_x)loss = out.sum()loss.backward()torch_w1_updated = conv1.weight.detach().numpy() - conv1.weight.grad.numpy() * lrtorch_w2_updated = conv2.weight.detach().numpy() - conv2.weight.grad.numpy() * lrprint(torch_w1_updated + torch_w2_updated)import torchimport numpy as npnp.random.seed(0)np_x = np.random.randn(1, 1, 5, 5).astype(np.float32)np_w1 = np.random.randn(1, 1, 3, 3).astype(np.float32)np_w2 = np.random.randn(1, 1, 3, 3).astype(np.float32)alpha1 = 1.0alpha2 = 1.0lr = 0.1fused_conv = torch.nn.Conv2d(1, 1, kernel_size=3, padding=1, bias=False)fused_conv.weight.data = torch.nn.Parameter(torch.tensor(alpha1 * np_w1 + alpha2 * np_w2))torch_x = torch.tensor(np_x, requires_grad=True)out = fused_conv(torch_x)loss = out.sum()loss.backward()torch_fused_w_updated = fused_conv.weight.detach().numpy() - (alpha1**2 + alpha2**2) * fused_conv.weight.grad.numpy() * lrprint(torch_fused_w_updated)stay RepOptVGG in , Corresponding CSLA Blocks are RepVGG In the block 3x3 Convolution ,1x1 Convolution ,bn The layer is replaced by With learnable scaling parameters 3x3 Convolution ,1x1 Convolution Further expand to multi branch , hypothesis s,t Namely 3x3 Convolution ,1x1 Scaling coefficient of convolution , Then the corresponding update rule is :

The first formula corresponds to the input channel == Output channel , There is a total of 3 Branches , Namely identity,conv3x3, conv1x1 The second formula corresponds to the input channel != Output channel , At this time only conv3x3, conv1x1 The third formula of the two branches corresponds to other situations. It should be noted that CSLA No, BN This nonlinear operator during training (training-time nonlinearity), There is no non sequency (non sequential) Trainable parameter .
边栏推荐
猜你喜欢
随机推荐
Redis learning notes - single key management
Basic use of lua
必须知道的RPC内核细节(值得收藏)!!!
[GXYCTF2019]BabySQli
Redis learning notes master-slave copy
ionic5表单输入框和单选按钮
Go 字符串比较
Chain representation and implementation of linklist ---- linear structure
分布式锁的三种实现方式
Bioinformatics | 基于相互作用神经网络的有效药物-靶标关联预测
UEFI source code learning 4.1 - pcihostbridgedxe
Redis learning notes pipeline
Learn SCI thesis drawing skills (f)
Redis learning notes - redis and Lua
Redis学习笔记—Redis与Lua
Find minimum in rotated sorted array
NIO例子
#gStore-weekly | gStore源码解析(四):安全机制之黑白名单配置解析
使用base64,展示图片
UEFI source code learning 3.7 - norflashdxe
![[ciscn2019 North China Day2 web1]hack world](/img/bf/51a24fd2f9f0e13dcd821b327b5a00.png)
![[SUCTF 2019]CheckIn](/img/0e/75bb14e7a3e55ddc5126581a663bfb.png)





