当前位置:网站首页>ICLR 2022 | dynamic convolution tadaconv in video and efficient convolution video understanding model tadaconvnext
ICLR 2022 | dynamic convolution tadaconv in video and efficient convolution video understanding model tadaconvnext
2022-06-23 09:39:00 【Zhiyuan community】

Title of thesis
TAda! Temporally-Adaptive Convolutions for Video Understanding
Thesis link :
https://arxiv.org/pdf/2110.06178.pdf
Project home page :
http://tadaconv-iclr2022.github.io
Code link :
https://github.com/alibaba-mmai-research/TAdaConv
Reading guide
Compared with the image , The video contains more information , More computing resources are usually required to process . therefore , How to efficiently understand the content of video has become one of the key research directions in video understanding .
I'm going to introduce you today TAdaConv, A plug and play in behavior recognition model Timing adaptive convolution (Temporally-Adaptive Convolutions). As 2D/3D Enhanced version of convolution ,TAdaConv Can significantly improve SlowFast、R2D and R3D Any use 2D/3D Model performance of convolution , The extra computation is negligible . stay Kinetics-400,Something-Something-V2 as well as Epic-Kitchens-100 Video classification task , be based on TAdaConv Built TAda2D and TAdaConvNeXt Have achieved highly competitive performance .
Besides , As an efficient way to introduce temporal context ,TAdaConv It is also applied to tasks other than video classification . stay CVPR 2022 TCTrack: Temporal Contexts for Aerial Tracking in ,TAdaConv Expanded to Online-TAdaConv, It is shown that it can be used in target tracking networks to extract features with spatiotemporal context , So as to improve the performance of the target tracker .

contribution
since P3D[1] and R(2+1)D[2] Respectively in 17 Year of ICCV and 18 year CVPR After being raised on , A large part of the time sequence of work is understood through the time axis 1D conv Accomplished , And its complexity is O(C^2xKxTHW). This pixel based operation , Will be in pure 2D conv On the basis of, it brings the computing overhead that can not be ignored . for instance , about K=3 Of 2D and 1D conv,1D conv Will be in 2D conv Increase the amount of calculation on the basis of 33%.
therefore , We think about , Is there a way to make 2D conv Have time series modeling ability ?
review 2D conv, Translation invariance (translation invariance) Or space invariance (spatial invariance) It is one of the important reasons for the great success of convolution [3]. So in the video model , Most models are in use 2D conv The shared weight is used for different positions of different frames , To achieve space-time invariance (spatiotemporal invariance). However , Some recent research findings in images , This strict weight sharing is actually too strong inductive bias, And appropriately relax some of this invariance , It can make the model have better spatial modeling ability [4][5].

therefore , We assume that : Relax the timing invariance of convolution (temporal invariance) It can enhance the timing modeling ability of convolution . Based on this assumption , We propose timing adaptive convolution (TAdaConv) To replace the convolution in the traditional video model , And based on ResNet and ConvNeXt Build an efficient video model TAda2D as well as TAdaConvNeXt.
Method
For spatial convolution , Timing invariance is reflected in that the weight of spatial convolution is shared in each frame of video ( Here's the picture (a) Shown ). therefore , To relax the invariance of timing ,TAdaConv Different convolution weights are used in different video frames ( Here's the picture (b) Shown ).

say concretely ,TAdaConv The convolution kernel of each frame \( \mathbf{W}_t \) Decompose into a base weight (base weight) And a calibration weight (calibration weight) The combination of :
\( \mathbf{W}_t = \alpha_t\cdot \mathbf{W}_b \)
among , Base weight \( \mathbf{W}_b \) Shared by video frames , And calibration weight \alpha_tαt For each frame , And generate according to the input .
There are three advantages to doing so :
- TAdaConv It can be plug and play , And the pre training weight of the model can still be retained and utilized ;
- Due to the existence of calibration weight , The temporal reasoning ability of convolution is enhanced , Spatial convolution is given the ability of temporal reasoning ;
- Compared with sequential convolution , Since timing convolution is an operation on the characteristic graph , and TAdaConv It's an operation on the convolution kernel ,TAdaConv More efficient .
2.2 Calibration weight generation Generation of calibration weights
The next question is , How to get different calibration weights for each frame \( \alpha_t \) ? We have considered a number of approaches , Including the use of learnable (learnable) Parameters , Or based on the global descriptor TT Calibration weights . In the end, we found out , For better time series modeling , The generation process of calibration weight also needs to consider the timing context information , That is to say \( \mathbf{x}_t^{\text{adj}}=...,\mathbf{x}_{t-1},\mathbf{x}_t,\mathbf{x}_{t+1},... \).
In the actual calibration weight generation process , We consider two temporal contexts . They are the local temporal context and the global context . To ensure the efficiency of the generation process , Calibration weights are based on frame descriptors (frame descriptor)\( \mathbf{v}_t=\text{GAP}_s(\mathbf{x}_t) \) To generate , Not frame based features . among , For local temporal context , We use two 1D Convolution is done :
\( \mathcal{F}(\mathbf{x}_t^{\text{adj}})=\text{Conv1D}^{C/r\rightarrow C}(\delta(\text{BN}(\text{Conv1D}^{C\rightarrow C/r}(\mathbf{v}_t^{\text{adj}})))) \)
Global context \mathbf{g}g Through a linear mapping (FC) Superimposed on the frame descriptor :
\( \mathcal{F}(\mathbf{x}_t^{\text{adj}}, \mathbf{g})=\text{Conv1D}^{C/r\rightarrow C}(\delta(\text{BN}(\text{Conv1D}^{C\rightarrow C/r}(\mathbf{v}_t^{\text{adj}}+\text{FC}^{C\rightarrow C}(\mathbf{g}))))) \)
The generation process can be shown in the following figure (b) Shown .

2.3 Initialization of calibration weight Initialization of calibration weights
Compared with the existing dynamic convolution methods , In order to make better use of the weight of pre training , We have carefully designed TAdaConv Initialization of calibration weight , To ensure that in the initial state ,TAdaConv Fully retain the weight of pre training . In particular , When the calibration weight generation function is initialized , The last layer 1D The weight of the convolution is initialized to all zeros , And added a 1 To ensure that all 1 Output :
\( \alpha_t=\mathcal{G}(\mathbf{x})=\mathbf{1}+\mathcal{F}(\mathbf{x}_t^{\text{adj}}, \mathbf{g}) \)
In this initial state , Weight of dynamic convolution \( \mathbf{W}_t=\mathbf{1}\cdot\mathbf{W}_b=\mathbf{W}_b \) , Weights loaded with pre training Wb identical .
contrast (2+1)D Conv,TAdaConv It has obvious advantages in calculation and parameters at both the operational level and the model level :

2.4 Calibrate the dimension of the weight Calibration dimension
For a base weight Wb Come on , Its dimension is usually \( C_{\text{out}}\times C_{\text{in}} \times k^2 \), So there are three dimensions that can be used to calibrate . We set the calibration dimension to \( C_{\text{in}} \).
03. Use in the video model TAdaConv
Please refer to for specific code implementation TAdaConv repo.
# 1. copy models/module_zoo/ops/tadaconv.py somewhere in your project # and import TAdaConv2d, RouteFuncMLPfrom tadaconv import TAdaConv2d, RouteFuncMLPclass Model(nn.Module): def __init__(self): ... # 2. define tadaconv and the route func in your model self.conv_rf = RouteFuncMLP( c_in=64, # number of input filters ratio=4, # reduction ratio for MLP kernels=[3,3], # list of temporal kernel sizes ) self.conv = TAdaConv2d( in_channels = 64, out_channels = 64, kernel_size = [1, 3, 3], # usually the temporal kernel size is fixed to be 1 stride = [1, 1, 1], # usually the temporal stride is fixed to be 1 padding = [0, 1, 1], # usually the temporal padding is fixed to be 0 bias = False, cal_dim = "cin" ) ... def self.forward(x): ... # 3. replace 'x = self.conv(x)' with the following line x = self.conv(x, self.conv_rf(x)) ...04. TAda2D & TAdaConvNeXt The Internet
be based on ConvNeXt[6] and ResNet, We build... Separately TAdaConvNeXt and TAda2D Video model . in the light of ConvNeXt, We will directly ConvNeXt Medium 2D depthwise Convolution replaced by depthwise TAdaConv, And follow most of the existing ones based on transformer Video model , Use 3D Of tokenization Method . in the light of ResNet, Because based on ResNet Our video model usually uses 2D stem, We propose an additional method based on average pooling Time series information aggregation method , Connected to the TAdaConv after :
\( \mathbf{x}_\text{aggr}=\delta(\text{BN}_1(\mathbf{\tilde{x}})+\text{BN}_2(\text{TempAvgPool}_k(\mathbf{\tilde{x}}))) \)
experiment
In the experiments , in the light of TAdaConv,TAda2D and TAdaConvNeXt, We are on two missions , altogether 5 Evaluation on data sets .
5.1 Hypothesis testing
We first verify that relaxing the timing invariance can improve the timing modeling ability . We used 3 A way to generate calibration weights , Respectively learnable calibration,dynamic calibration And our TAda. about learnable and dynamic, We are divided into two types , Whether or not temporally varying(T.V.),T.V. For different frames , Convolution weights are different ( That is to relax temporal invariance). The following conclusions can be drawn from the table below
- Can learn to calibrate , Compared with those without calibration weights baseline To a certain extent (2.3%), And if you relax temporal invariance, Then the performance improvement will reach 13.4%
- Dynamic calibration even in strict temporal invariance Constrained by , Will also be in baseline There is a greater improvement in (19.2%), Further relax temporal invariance, Performance can be further improved (21.8%)
- TAdaConv The calibration weight generation method combining local and global timing context has the best performance (59.2%)
And that proves it , Relax properly temporal invariance It can be beneficial to time series modeling .

5.2 Insert experiment
You can see , take TAdaConv Put it on the mainstream video convolution network , Include R(2+1)D[2], R3D[7], SlowFast[8], You can see a considerable increase . Among them in kinetics400 On average, the increase is about 1.3%, stay something-something-v2 The average improvement on the data set is about 2.8%, stay epic-kitchens The average increase in 1% above . meanwhile , The increase in the amount of calculation is only 0.01~0.02GFlops.


5.3 The result of action recognition
We are mainly in Kinetics-400、Something-Something-V2 as well as Epic-Kitchens-100 Perform the evaluation of motion recognition on .TAda2D Compared with the existing convolutional video model, it has better performance and computational complexity tradeoff, and TAdaConvNeXt Is relatively recent based on transformer The video model has a very competitive performance .



5.4 The result of timing action positioning
On the task of timing action positioning , We are HACS and Epic-Kitchens-100 On the assessment . contrast baseline TSN,TAda2D The features provided are on both data sets 5% Of mAP promote .

5.5 Ablation Experiment
We are right. TAda Calibration weight generation (calibration weight generation) The process of ablation (Table 4). We found that , In the process of calibration weight generation , Considering the temporal context is TAdaConv The key to performance improvement . stay baseline On the basis of , Use two 1D Conv (Non-Lin.) Generate calibration weights based on local context , Can bring 25.8% The performance gain of , Simply considering the global context can lead to 17.6%(54.4 vs. 36.8) The performance gain of . Combination of the two , stay baseline Can bring 27.2% The performance gain of .
Further add temporal information aggregation , We found that , about avg pooling Later features use different batchnorm, Compare the same batchnorm Can bring 8% The performance gain of .

We try in different stage, In different quantities channel Admiral 2D conv Replace with TAdaConv( On the left ). We found that , Only use 1/64 The channel of is enough to greatly improve the timing modeling ability of the network . And for different stage Come on , Deeper in the network will 2D conv Substitute for TAdaConv Bring more gain . Compared with existing video models ( Right picture ),TAda2D and TAdaConvNeXt The optimal performance and computational complexity are achieved tradeoff.

For calibration dimensions , We found that the performance improvement was greatest when the input channel was calibrated . But as long as it is calibrated , Performance is similar .

reference
[1] Learning spatio-temporal representation with pseudo-3d residual networks, in ICCV, 2017.
[2] A closer look at spatiotemporal convolutions for action recognition, in CVPR 2018.
[3] Natural image statistics and neural representation, in Physical review letters 1994.
[4] Revisiting spatial invariance with low-rank local connectivity, in ICML 2020.
[5] Decoupled Dynamic Filter Networks, in CVPR 2021.
[6] A ConvNet for the 2020s, in arXiv 2022.
[7] Learning spatiotemporal features with 3d convolutional networks, in ICCV 2015.
[8] Slowfast networks for video recognition, in ICCV 2019.
边栏推荐
- One of the 12 balls is different from the others. Provide a balance and find it three times
- Kotlin Series 1: getting started with basics
- GPIO novice
- Redis学习笔记—数据类型:集合(set)
- Subscript operator of map
- Redis learning notes pipeline
- [GXYCTF2019]BabySQli
- [GXYCTF2019]BabySQli
- 也无风雨也无晴
- Pizza ordering design - simple factory model
猜你喜欢
随机推荐
Subscript operator of map
xml相关面试题
cooding代码库的使用笔记
Notes on using the coding code base
【CTF】 2018_rop
UEFI source code learning 3.7 - norflashdxe
Redis learning notes - data type: ordered set (Zset)
Redis learning notes - AOF of persistence mechanism
ionic5表单输入框和单选按钮
学习SCI论文绘制技巧(E)
Sequential representation and implementation of sequencelist -- linear structure
GPIO初识
Ionic5 form input box and radio button
UCOSII (learning notes)
2022 Gdevops全球敏捷运维峰会-广州站精华回放(附ppt下载)
Redis学习笔记—持久化机制之RDB
Redis学习笔记—数据类型:集合(set)
Redis学习笔记—事务
AI: the Elephant in Room
Cesium加载正射影像方案



![[GXYCTF2019]BabySQli](/img/51/a866a170dd6c0160ce98d553333624.png)

![[wangdingbei 2020 Qinglong formation]areuserialz](/img/38/b67f7a42abec1cdaad02f2b7df6546.png)
![[geek Challenge 2019] hardsql](/img/73/ebfb410296b8e950c9ac0cf00adc17.png)