当前位置:网站首页>ICLR 2022 | dynamic convolution tadaconv in video and efficient convolution video understanding model tadaconvnext

ICLR 2022 | dynamic convolution tadaconv in video and efficient convolution video understanding model tadaconvnext

2022-06-23 09:39:00 Zhiyuan community

Title of thesis

TAda! Temporally-Adaptive Convolutions for Video Understanding

Thesis link :

https://arxiv.org/pdf/2110.06178.pdf

Project home page :

http://tadaconv-iclr2022.github.io

Code link :

https://github.com/alibaba-mmai-research/TAdaConv

 

Reading guide

Compared with the image , The video contains more information , More computing resources are usually required to process . therefore , How to efficiently understand the content of video has become one of the key research directions in video understanding .

I'm going to introduce you today TAdaConv, A plug and play in behavior recognition model Timing adaptive convolution (Temporally-Adaptive Convolutions). As 2D/3D Enhanced version of convolution ,TAdaConv Can significantly improve SlowFast、R2D and R3D Any use 2D/3D Model performance of convolution , The extra computation is negligible . stay Kinetics-400,Something-Something-V2 as well as Epic-Kitchens-100 Video classification task , be based on TAdaConv Built TAda2D and TAdaConvNeXt Have achieved highly competitive performance .

Besides , As an efficient way to introduce temporal context ,TAdaConv It is also applied to tasks other than video classification . stay CVPR 2022 TCTrack: Temporal Contexts for Aerial Tracking  in ,TAdaConv Expanded to Online-TAdaConv, It is shown that it can be used in target tracking networks to extract features with spatiotemporal context , So as to improve the performance of the target tracker .

 

contribution

since P3D[1] and R(2+1)D[2] Respectively in 17 Year of ICCV and 18 year CVPR After being raised on , A large part of the time sequence of work is understood through the time axis 1D conv Accomplished , And its complexity is O(C^2xKxTHW). This pixel based operation , Will be in pure 2D conv On the basis of, it brings the computing overhead that can not be ignored . for instance , about K=3 Of 2D and 1D conv,1D conv Will be in 2D conv Increase the amount of calculation on the basis of 33%.

therefore , We think about , Is there a way to make 2D conv Have time series modeling ability ?

review 2D conv, Translation invariance (translation invariance) Or space invariance (spatial invariance) It is one of the important reasons for the great success of convolution [3]. So in the video model , Most models are in use 2D conv The shared weight is used for different positions of different frames , To achieve space-time invariance (spatiotemporal invariance). However , Some recent research findings in images , This strict weight sharing is actually too strong inductive bias, And appropriately relax some of this invariance , It can make the model have better spatial modeling ability [4][5].

therefore , We assume that : Relax the timing invariance of convolution (temporal invariance) It can enhance the timing modeling ability of convolution . Based on this assumption , We propose timing adaptive convolution (TAdaConv) To replace the convolution in the traditional video model , And based on ResNet and ConvNeXt Build an efficient video model TAda2D as well as TAdaConvNeXt.

 

Method

For spatial convolution , Timing invariance is reflected in that the weight of spatial convolution is shared in each frame of video ( Here's the picture (a) Shown ). therefore , To relax the invariance of timing ,TAdaConv Different convolution weights are used in different video frames ( Here's the picture (b) Shown ).

say concretely ,TAdaConv The convolution kernel of each frame \( \mathbf{W}_t \) Decompose into a base weight (base weight) And a calibration weight (calibration weight) The combination of :

\( \mathbf{W}_t = \alpha_t\cdot \mathbf{W}_b \)

among , Base weight \( \mathbf{W}_b \) Shared by video frames , And calibration weight  \alpha_tαt​  For each frame , And generate according to the input .

There are three advantages to doing so :

  1. TAdaConv It can be plug and play , And the pre training weight of the model can still be retained and utilized ;
  2. Due to the existence of calibration weight , The temporal reasoning ability of convolution is enhanced , Spatial convolution is given the ability of temporal reasoning ;
  3. Compared with sequential convolution , Since timing convolution is an operation on the characteristic graph , and TAdaConv It's an operation on the convolution kernel ,TAdaConv More efficient .

 

2.2 Calibration weight generation Generation of calibration weights

The next question is , How to get different calibration weights for each frame \( \alpha_t \) ? We have considered a number of approaches , Including the use of learnable (learnable) Parameters , Or based on the global descriptor TT  Calibration weights . In the end, we found out , For better time series modeling , The generation process of calibration weight also needs to consider the timing context information , That is to say \( \mathbf{x}_t^{\text{adj}}=...,\mathbf{x}_{t-1},\mathbf{x}_t,\mathbf{x}_{t+1},... \).

In the actual calibration weight generation process , We consider two temporal contexts . They are the local temporal context and the global context . To ensure the efficiency of the generation process , Calibration weights are based on frame descriptors (frame descriptor)\( \mathbf{v}_t=\text{GAP}_s(\mathbf{x}_t) \) To generate , Not frame based features . among , For local temporal context , We use two 1D Convolution is done :

\( \mathcal{F}(\mathbf{x}_t^{\text{adj}})=\text{Conv1D}^{C/r\rightarrow C}(\delta(\text{BN}(\text{Conv1D}^{C\rightarrow C/r}(\mathbf{v}_t^{\text{adj}})))) \)

Global context  \mathbf{g}g  Through a linear mapping (FC) Superimposed on the frame descriptor :

\( \mathcal{F}(\mathbf{x}_t^{\text{adj}}, \mathbf{g})=\text{Conv1D}^{C/r\rightarrow C}(\delta(\text{BN}(\text{Conv1D}^{C\rightarrow C/r}(\mathbf{v}_t^{\text{adj}}+\text{FC}^{C\rightarrow C}(\mathbf{g}))))) \)

The generation process can be shown in the following figure (b) Shown .

 

2.3 Initialization of calibration weight Initialization of calibration weights

Compared with the existing dynamic convolution methods , In order to make better use of the weight of pre training , We have carefully designed TAdaConv Initialization of calibration weight , To ensure that in the initial state ,TAdaConv Fully retain the weight of pre training . In particular , When the calibration weight generation function is initialized , The last layer 1D The weight of the convolution is initialized to all zeros , And added a 1 To ensure that all 1 Output :

\( \alpha_t=\mathcal{G}(\mathbf{x})=\mathbf{1}+\mathcal{F}(\mathbf{x}_t^{\text{adj}}, \mathbf{g}) \)

In this initial state , Weight of dynamic convolution \( \mathbf{W}_t=\mathbf{1}\cdot\mathbf{W}_b=\mathbf{W}_b \)​ , Weights loaded with pre training Wb​  identical .

contrast (2+1)D Conv,TAdaConv It has obvious advantages in calculation and parameters at both the operational level and the model level :

 

2.4 Calibrate the dimension of the weight Calibration dimension

For a base weight Wb​ Come on , Its dimension is usually \( C_{\text{out}}\times C_{\text{in}} \times k^2 \), So there are three dimensions that can be used to calibrate . We set the calibration dimension to \( C_{\text{in}} \).

 

03. Use in the video model TAdaConv

Please refer to for specific code implementation TAdaConv repo.

# 1. copy models/module_zoo/ops/tadaconv.py somewhere in your project #    and import TAdaConv2d, RouteFuncMLPfrom tadaconv import TAdaConv2d, RouteFuncMLPclass Model(nn.Module):  def __init__(self):    ...    # 2. define tadaconv and the route func in your model    self.conv_rf = RouteFuncMLP(                c_in=64,            # number of input filters                ratio=4,            # reduction ratio for MLP                kernels=[3,3],      # list of temporal kernel sizes    )    self.conv = TAdaConv2d(                in_channels     = 64,                out_channels    = 64,                kernel_size     = [1, 3, 3], # usually the temporal kernel size is fixed to be 1                stride          = [1, 1, 1], # usually the temporal stride is fixed to be 1                padding         = [0, 1, 1], # usually the temporal padding is fixed to be 0                bias            = False,                cal_dim         = "cin"            )     ...  def self.forward(x):    ...        # 3. replace 'x = self.conv(x)' with the following line    x = self.conv(x, self.conv_rf(x))    ...

 

04. TAda2D & TAdaConvNeXt The Internet

be based on ConvNeXt[6] and ResNet, We build... Separately TAdaConvNeXt and TAda2D Video model . in the light of ConvNeXt, We will directly ConvNeXt Medium 2D depthwise Convolution replaced by depthwise TAdaConv, And follow most of the existing ones based on transformer Video model , Use 3D Of tokenization Method . in the light of ResNet, Because based on ResNet Our video model usually uses 2D stem, We propose an additional method based on average pooling Time series information aggregation method , Connected to the TAdaConv after :

\( \mathbf{x}_\text{aggr}=\delta(\text{BN}_1(\mathbf{\tilde{x}})+\text{BN}_2(\text{TempAvgPool}_k(\mathbf{\tilde{x}}))) \)

 

experiment

In the experiments , in the light of TAdaConv,TAda2D and TAdaConvNeXt, We are on two missions , altogether 5 Evaluation on data sets .

5.1 Hypothesis testing

We first verify that relaxing the timing invariance can improve the timing modeling ability . We used 3 A way to generate calibration weights , Respectively learnable calibration,dynamic calibration And our TAda. about learnable and dynamic, We are divided into two types , Whether or not temporally varying(T.V.),T.V. For different frames , Convolution weights are different ( That is to relax temporal invariance). The following conclusions can be drawn from the table below

  1. Can learn to calibrate , Compared with those without calibration weights baseline To a certain extent (2.3%), And if you relax temporal invariance, Then the performance improvement will reach 13.4%
  2. Dynamic calibration even in strict temporal invariance Constrained by , Will also be in baseline There is a greater improvement in (19.2%), Further relax temporal invariance, Performance can be further improved (21.8%)
  3. TAdaConv The calibration weight generation method combining local and global timing context has the best performance (59.2%)

And that proves it , Relax properly temporal invariance It can be beneficial to time series modeling .

5.2 Insert experiment

You can see , take TAdaConv Put it on the mainstream video convolution network , Include R(2+1)D[2], R3D[7], SlowFast[8], You can see a considerable increase . Among them in kinetics400 On average, the increase is about 1.3%, stay something-something-v2 The average improvement on the data set is about 2.8%, stay epic-kitchens The average increase in 1% above . meanwhile , The increase in the amount of calculation is only 0.01~0.02GFlops.

5.3 The result of action recognition

We are mainly in Kinetics-400、Something-Something-V2 as well as Epic-Kitchens-100 Perform the evaluation of motion recognition on .TAda2D Compared with the existing convolutional video model, it has better performance and computational complexity tradeoff, and TAdaConvNeXt Is relatively recent based on transformer The video model has a very competitive performance .

5.4 The result of timing action positioning

On the task of timing action positioning , We are HACS and Epic-Kitchens-100 On the assessment . contrast baseline TSN,TAda2D The features provided are on both data sets 5% Of mAP promote .

5.5 Ablation Experiment

We are right. TAda Calibration weight generation (calibration weight generation) The process of ablation (Table 4). We found that , In the process of calibration weight generation , Considering the temporal context is TAdaConv The key to performance improvement . stay baseline On the basis of , Use two 1D Conv (Non-Lin.) Generate calibration weights based on local context , Can bring 25.8% The performance gain of , Simply considering the global context can lead to 17.6%(54.4 vs. 36.8) The performance gain of . Combination of the two , stay baseline Can bring 27.2% The performance gain of .

Further add temporal information aggregation , We found that , about avg pooling Later features use different batchnorm, Compare the same batchnorm Can bring 8% The performance gain of .

We try in different stage, In different quantities channel Admiral 2D conv Replace with TAdaConv( On the left ). We found that , Only use 1/64 The channel of is enough to greatly improve the timing modeling ability of the network . And for different stage Come on , Deeper in the network will 2D conv Substitute for TAdaConv Bring more gain . Compared with existing video models ( Right picture ),TAda2D and TAdaConvNeXt The optimal performance and computational complexity are achieved tradeoff.

For calibration dimensions , We found that the performance improvement was greatest when the input channel was calibrated . But as long as it is calibrated , Performance is similar .

 

reference

[1] Learning spatio-temporal representation with pseudo-3d residual networks, in ICCV, 2017.

[2] A closer look at spatiotemporal convolutions for action recognition, in CVPR 2018.

[3] Natural image statistics and neural representation, in Physical review letters 1994.

[4] Revisiting spatial invariance with low-rank local connectivity, in ICML 2020.

[5] Decoupled Dynamic Filter Networks, in CVPR 2021.

[6] A ConvNet for the 2020s, in arXiv 2022.

[7] Learning spatiotemporal features with 3d convolutional networks, in ICCV 2015.

[8] Slowfast networks for video recognition, in ICCV 2019.

原网站

版权声明
本文为[Zhiyuan community]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/174/202206230932219138.html