当前位置：网站首页>CVPR 2022 oral | NVIDIA proposes an efficient visual transformer network a-vit with adaptive token. The calculation of unimportant tokens can be stopped in advance

CVPR 2022 oral | NVIDIA proposes an efficient visual transformer network a-vit with adaptive token. The calculation of unimportant tokens can be stopped in advance

2022-06-24 10:08:00 【Zhiyuan community】

Thesis link ：https://arxiv.org/pdf/2112.07658.pdf

Reading guide

In this paper, we propose a method for image processing with different complexity , Adaptive adjustment reasoning cost vision Transformer（ViT） Methods ——A-ViT.A-ViT By automatically reducing the visual processing in the network in the reasoning process Transformer Medium token Quantity to achieve this . The author redesigns the adaptive computing time for this task （Adaptive Computation Time ,ACT）, Discard redundant space token. Vision Transformer The structural characteristics of this paper make it adaptive token The reduction mechanism can speed up the reasoning without modifying the network structure or reasoning hardware . The author proves that A-ViT No additional parameters or subnets are required , Because the method in this paper can stop adaptively based on the original network parameter learning . The author further introduces the distribution prior regularization , Prior to ACT Methods compared , It can stabilize training . In the image classification task （ImageNet1K） in , The author indicates that A-ViT It has high efficiency in filtering informative spatial features and reducing the total amount of computation . This method will DeiT-Tiny and DeiT-Small The throughput of is improved 62% and 38%, The accuracy decreased only 0.3%, It is much better than the existing technology .

contribution

Transformer It is a popular neural network structure , It uses a highly expressive attention mechanism to calculate network output . They originated from naturallanguageprocessing （NLP） Community , It has been proved that NLP A wide range of issues in , Such as machine translation 、 Representational learning and Q & A .

lately , Vision Transformer More and more popular in the visual world , And has been successfully applied to a wide range of visual applications , Such as image classification 、 object detection 、 Image generation and semantic segmentation . at present , The most popular example is still ,vision transformers By dividing the image into a series of ordered patch To form token, And in token To solve the underlying tasks .

Use vision transformers The computational cost of processing images is still high , This is mainly because token The number of interactions between them is twice . therefore , Deploy on a data processing cluster or edge device vision Transformer It's a challenging task , Requires a lot of computing and memory resources .

This paper focuses on how to automatically adjust the vision according to the complexity of the input image Transformer Calculation in . Almost all mainstream visual Transformer There is a fixed cost in the reasoning process , This cost is independent of the input . However , The difficulty of the prediction task varies with the complexity of the input image . for example , It is relatively simple to classify cars and humans from a single image with homogeneous background ; It is more challenging to distinguish different breeds of dogs in a complex background . Even in a single image , And in the background patch comparison , Contains detailed object features patch Can also provide more information . Inspired by this , The author developed an input - based adaptive vision system Transformer The computational framework used in .

The input dependent reasoning of neural networks has been studied in previous work . Previous work proposed adaptive computing time （ACT）, The output of the neural module is expressed as halting Distribution defined mean field model . This formula will discretize halting The problem is relaxed into a continuous optimization problem , Minimize the upper bound of the total calculation . However , The author shows , Vision Transformer Uniform shape and tokenization So that the adaptive computing method can produce direct acceleration on off the shelf hardware , In efficiency - The accuracy tradeoff is beyond previous work .

In this paper , The author puts forward a vision Transformer Input dependent adaptive inference mechanism . A simple way is to follow ACT, That is, for all of the remaining layers token Stop calculation at the same time . The author observed , This method reduces the amount of computation by a small amount , But it will cause unnecessary loss of accuracy .

To solve this problem , The author puts forward A-ViT, This is a spatial adaptive reasoning mechanism , It stops calculating different... At different depths token, In a dynamic way, it is only differentiation token Keep the calculation . And point by point in convolution feature mapping ACT Different , The space pauses in this article are directly supported by high-performance hardware , Because of the pause token Can be effectively removed from the underlying computation . Besides , The entire stop mechanism can be learned using existing parameters in the model （halting mechanism）, Without introducing any additional parameters . Besides , The author also puts forward a new method , Different computational budgets are imposed by applying a distribution prior to the stopping probability .

According to the experimental observation , The depth of computation is highly related to object semantics , This shows that the model in this paper can ignore less relevant background information （ See the example in the above figure for details ）. The method proposed in this paper significantly reduces the reasoning cost ——A-ViT take DEIT-Tiny The throughput of 62%,DEIT-Small The throughput of 38%, and ImageNet1K The accuracy of is only reduced 0.3%. The main contributions of this paper are as follows ：

This paper introduces a new method in vision Transformer The method of input related reasoning in , This method allows stopping at different depths for different token The calculation of .
The author will adapt token The stopped learning builds on the existing embedded dimensions in the original structure , And no additional parameters or calculations are required to stop the calculation .
The authors introduce the distribution prior regularization to guide the stop towards stability ACT Specific distribution and average of training token Development in depth .
The author analyzes the differences in different images token The depth of the , And deeply understand the vision Transformer The attention mechanism of .
The experiments in this paper show that , The throughput of the proposed method on hardware is improved 62%, And the accuracy drop is very small .

Method

Consider a visual Transformer The Internet , The network will image x∈R^C*H*W（C、H and W They are channels 、 Height and width ） As input , Forecast by ：

among , Coding network E() The image patch Convert to a series of token t∈R^K*E,K Is the total number of tags ,E Is each token The embedded dimension of , and L A middle Transformer Blocks process input through self attention token. Consider the l Layer of transformer block , It deals with... In the following way l-1 Layer out all token：

The left side represents all the updated token.Transformer block F() The internal computing flow of makes token The number of K You can change from one layer to another . When token When discarded due to a stop mechanism , This provides the calculated gain .Vision transformer For all... In the entire layer token Use consistent feature sizes .

This makes it easy to learn and capture the global stop mechanism that monitors all layers in a federated manner . Different model sizes need to be explicitly addressed （ for example , Number of channels at different depths ） Of CNN comparison , It also makes Transformer It is easier to design the stop mechanism .

For adaptive stop token, The author wrote for each token A stop fraction related to input is introduced , As a first l layer token k Stopping probability of ：

among H() Is a stop module . And ACT similar , The author forces every token h^l_k Your pause score is [0,1] Within the scope of , And use cumulative importance to stop when reasoning goes deeper token. So , The author's cumulative stop score exceeds 1-ϵ When an token stop it ：

In style ,ϵ Is a small normal number , It is allowed to stop after one layer . In order to further alleviate the dynamic stop between adjacent layers token Any reliance on . The author in token After stopping , All remaining depths will be shielded l>N_k Of token, The method is （1） take token Zero value ;（2） Prevent them from focusing on other token, Shield its impact on . Author defined to force all token Stop at the last floor . In this paper, the token Masking makes the computational cost of the training iteration the same as the original vision transformer The training costs are similar . However , In reasoning , Just delete the stopped... In the calculation token, To measure the actual acceleration obtained by the stop mechanism .

The author passes through the MLP Assign a neuron in the layer to complete the task , take H() Integrate into existing vision Transformer In block . therefore , It is possible to compute without introducing any additional learnable parameters or stopping mechanisms . More specifically , Every token The embedded dimension of E There is enough capacity to adapt to stop learning , So as to stop the score calculation , As shown below ：

among ,t^l_k,e Express token t^l_k Of the e Dimensions ,σ yes logistic sigmoid function .β and γ Is to adjust the embedded shift and scale parameters before applying nonlinearity . These two scalar parameters are used in all token Shared on all layers of . Only embedded dimensions E A dimension of is used to stop score calculation . Experimentally , The author observed , Simple choice （ The first dimension ） To perform well . Besides , The stopping mechanism in this article does not apply to two scalar parameters β and γ Introduce additional parameters or subnets .

In order to track the progress of the stopping probability of each layer , The author calculates each token The remainder of , As shown below ：

Then the stopping probability is formed , As shown below ：

Given h and r The scope of the , Each layer token The stopping probability of is [0,1]. The loss of encouraging early stop is ：

Vision transformers Use a special class token To generate classification forecasts , Express it as . The token And other inputs token similar , Updates in all layers . The author applies the mean field formula （ Weighted average of the stopping probability of the previous state ） To form the output token And related mission losses , As shown below ：

then , The vision of this paper Transformer You can train by minimizing ：

among ,α_p Indicates the proportion of the second item to the loss of the main task .

The algorithm in the figure above 1 Shows the entire computing flow .

The figure above shows the flow of the stop mechanism .

An important term in the loss function is α_p.α_p The bigger the value is. , The heavier the penalty item , So it can be stopped in advance token. Although effective for reducing calculations , But previous work on adaptive computing has found that , Training is right α_p Your choice is very sensitive , Its value may not provide an accurate - Fine grained control of efficiency tradeoffs . therefore , The author introduced a distribution before regularization , such token You can exit at the target depth . under these circumstances , For an unlimited number of input images , The author expects token The depth of changes within the distribution prior . So , The author defines the stop fraction distribution ：

This averages all... At each layer of the network token Expected stop score . This is used as an estimate of the distribution of the stopping probability among the layers , have access to KL Divergence adjusts this distribution to a predefined prior distribution . therefore , The new distribution prior regularization term is expressed as ：

among ,KL Express KL The divergence ,H^target Represents the target stop score distribution with the guidance stop layer . In this paper, the probability density function of Gaussian distribution is used to define a bell shaped distribution target H^target, The distribution is at the expected stopping depth H^target Goal centered . This encourages everyone token The sum of the expected stop points of triggers N Exit conditions for targets , Provides enhanced control over the desired residual computation .

This article is adaptive token The final loss function formula for calculating the parameters of the training network is as follows ：

among α_d Is a scalar coefficient , Used to balance distribution regularization with other loss terms .

experiment

The figure above shows that the ImageNet1K Verification set , Use the of this article A-ViT-T Adaptive control during reasoning token depth .

Upper figure （a） Describes the learning on the validation set token The average depth of , It is displayed as a two-dimensional Gaussian distribution centered on the center of the image . It's not the same as most ImageNet The fact that the sample is in the middle , Visually align with image distribution . Upper figure （b） Show all of each layer of each image token Block diagram of the average stop score on . The stop score increases gradually in the initial stage , Reach the peak , And then down .

You can view the adaption of each image token To analyze the difficulty of images in the network . therefore , In the diagram above , The author describes the difficult sample and the simple sample according to the required calculation .

An adaptive reasoning paradigm is presented , Note that this analyses the variation in classification accuracy of different categories relative to the complete model . The author calculated the change of class verification accuracy before and after the application of adaptive reasoning . The author summarizes the qualitative and quantitative results in the above table .

The above table shows the methods and research of this paper Transformer Comparison of the existing technologies of the dynamic reasoning stopping mechanism .

To further visualize the most advanced DynamicViT Improvement , The author presents token Qualitative comparison of depth .

In the above table , The author compares the existing methods GPU Speedup ratio .