当前位置:网站首页>CVPR 2022 oral | NVIDIA proposes an efficient visual transformer network a-vit with adaptive token. The calculation of unimportant tokens can be stopped in advance
CVPR 2022 oral | NVIDIA proposes an efficient visual transformer network a-vit with adaptive token. The calculation of unimportant tokens can be stopped in advance
2022-06-24 10:08:00 【Zhiyuan community】

Thesis link :https://arxiv.org/pdf/2112.07658.pdf
Reading guide
In this paper, we propose a method for image processing with different complexity , Adaptive adjustment reasoning cost vision Transformer(ViT) Methods ——A-ViT.A-ViT By automatically reducing the visual processing in the network in the reasoning process Transformer Medium token Quantity to achieve this . The author redesigns the adaptive computing time for this task (Adaptive Computation Time ,ACT), Discard redundant space token. Vision Transformer The structural characteristics of this paper make it adaptive token The reduction mechanism can speed up the reasoning without modifying the network structure or reasoning hardware . The author proves that A-ViT No additional parameters or subnets are required , Because the method in this paper can stop adaptively based on the original network parameter learning . The author further introduces the distribution prior regularization , Prior to ACT Methods compared , It can stabilize training . In the image classification task (ImageNet1K) in , The author indicates that A-ViT It has high efficiency in filtering informative spatial features and reducing the total amount of computation . This method will DeiT-Tiny and DeiT-Small The throughput of is improved 62% and 38%, The accuracy decreased only 0.3%, It is much better than the existing technology .
contribution
Transformer It is a popular neural network structure , It uses a highly expressive attention mechanism to calculate network output . They originated from naturallanguageprocessing (NLP) Community , It has been proved that NLP A wide range of issues in , Such as machine translation 、 Representational learning and Q & A .
lately , Vision Transformer More and more popular in the visual world , And has been successfully applied to a wide range of visual applications , Such as image classification 、 object detection 、 Image generation and semantic segmentation . at present , The most popular example is still ,vision transformers By dividing the image into a series of ordered patch To form token, And in token To solve the underlying tasks .
Use vision transformers The computational cost of processing images is still high , This is mainly because token The number of interactions between them is twice . therefore , Deploy on a data processing cluster or edge device vision Transformer It's a challenging task , Requires a lot of computing and memory resources .
This paper focuses on how to automatically adjust the vision according to the complexity of the input image Transformer Calculation in . Almost all mainstream visual Transformer There is a fixed cost in the reasoning process , This cost is independent of the input . However , The difficulty of the prediction task varies with the complexity of the input image . for example , It is relatively simple to classify cars and humans from a single image with homogeneous background ; It is more challenging to distinguish different breeds of dogs in a complex background . Even in a single image , And in the background patch comparison , Contains detailed object features patch Can also provide more information . Inspired by this , The author developed an input - based adaptive vision system Transformer The computational framework used in .
The input dependent reasoning of neural networks has been studied in previous work . Previous work proposed adaptive computing time (ACT), The output of the neural module is expressed as halting Distribution defined mean field model . This formula will discretize halting The problem is relaxed into a continuous optimization problem , Minimize the upper bound of the total calculation . However , The author shows , Vision Transformer Uniform shape and tokenization So that the adaptive computing method can produce direct acceleration on off the shelf hardware , In efficiency - The accuracy tradeoff is beyond previous work .
In this paper , The author puts forward a vision Transformer Input dependent adaptive inference mechanism . A simple way is to follow ACT, That is, for all of the remaining layers token Stop calculation at the same time . The author observed , This method reduces the amount of computation by a small amount , But it will cause unnecessary loss of accuracy .
To solve this problem , The author puts forward A-ViT, This is a spatial adaptive reasoning mechanism , It stops calculating different... At different depths token, In a dynamic way, it is only differentiation token Keep the calculation . And point by point in convolution feature mapping ACT Different , The space pauses in this article are directly supported by high-performance hardware , Because of the pause token Can be effectively removed from the underlying computation . Besides , The entire stop mechanism can be learned using existing parameters in the model (halting mechanism), Without introducing any additional parameters . Besides , The author also puts forward a new method , Different computational budgets are imposed by applying a distribution prior to the stopping probability .
According to the experimental observation , The depth of computation is highly related to object semantics , This shows that the model in this paper can ignore less relevant background information ( See the example in the above figure for details ). The method proposed in this paper significantly reduces the reasoning cost ——A-ViT take DEIT-Tiny The throughput of 62%,DEIT-Small The throughput of 38%, and ImageNet1K The accuracy of is only reduced 0.3%. The main contributions of this paper are as follows :
- This paper introduces a new method in vision Transformer The method of input related reasoning in , This method allows stopping at different depths for different token The calculation of .
- The author will adapt token The stopped learning builds on the existing embedded dimensions in the original structure , And no additional parameters or calculations are required to stop the calculation .
- The authors introduce the distribution prior regularization to guide the stop towards stability ACT Specific distribution and average of training token Development in depth .
- The author analyzes the differences in different images token The depth of the , And deeply understand the vision Transformer The attention mechanism of .
- The experiments in this paper show that , The throughput of the proposed method on hardware is improved 62%, And the accuracy drop is very small .
Method
Consider a visual Transformer The Internet , The network will image x∈RC*H*W(C、H and W They are channels 、 Height and width ) As input , Forecast by :
among , Coding network E() The image patch Convert to a series of token t∈RK*E,K Is the total number of tags ,E Is each token The embedded dimension of , and L A middle Transformer Blocks process input through self attention token. Consider the l Layer of transformer block , It deals with... In the following way l-1 Layer out all token:
The left side represents all the updated token.Transformer block F() The internal computing flow of makes token The number of K You can change from one layer to another . When token When discarded due to a stop mechanism , This provides the calculated gain .Vision transformer For all... In the entire layer token Use consistent feature sizes .
This makes it easy to learn and capture the global stop mechanism that monitors all layers in a federated manner . Different model sizes need to be explicitly addressed ( for example , Number of channels at different depths ) Of CNN comparison , It also makes Transformer It is easier to design the stop mechanism .
For adaptive stop token, The author wrote for each token A stop fraction related to input is introduced , As a first l layer token k Stopping probability of :
among H() Is a stop module . And ACT similar , The author forces every token hlk Your pause score is [0,1] Within the scope of , And use cumulative importance to stop when reasoning goes deeper token. So , The author's cumulative stop score exceeds 1-ϵ When an token stop it :
In style ,ϵ Is a small normal number , It is allowed to stop after one layer . In order to further alleviate the dynamic stop between adjacent layers token Any reliance on . The author in token After stopping , All remaining depths will be shielded l>Nk Of token, The method is (1) take token Zero value ;(2) Prevent them from focusing on other token, Shield its impact on . Author defined to force all token Stop at the last floor . In this paper, the token Masking makes the computational cost of the training iteration the same as the original vision transformer The training costs are similar . However , In reasoning , Just delete the stopped... In the calculation token, To measure the actual acceleration obtained by the stop mechanism .
The author passes through the MLP Assign a neuron in the layer to complete the task , take H() Integrate into existing vision Transformer In block . therefore , It is possible to compute without introducing any additional learnable parameters or stopping mechanisms . More specifically , Every token The embedded dimension of E There is enough capacity to adapt to stop learning , So as to stop the score calculation , As shown below :
among ,tlk,e Express token tlk Of the e Dimensions ,σ yes logistic sigmoid function .β and γ Is to adjust the embedded shift and scale parameters before applying nonlinearity . These two scalar parameters are used in all token Shared on all layers of . Only embedded dimensions E A dimension of is used to stop score calculation . Experimentally , The author observed , Simple choice ( The first dimension ) To perform well . Besides , The stopping mechanism in this article does not apply to two scalar parameters β and γ Introduce additional parameters or subnets .
In order to track the progress of the stopping probability of each layer , The author calculates each token The remainder of , As shown below :
Then the stopping probability is formed , As shown below :
Given h and r The scope of the , Each layer token The stopping probability of is [0,1]. The loss of encouraging early stop is :
Vision transformers Use a special class token To generate classification forecasts , Express it as . The token And other inputs token similar , Updates in all layers . The author applies the mean field formula ( Weighted average of the stopping probability of the previous state ) To form the output token And related mission losses , As shown below :
then , The vision of this paper Transformer You can train by minimizing :
among ,αp Indicates the proportion of the second item to the loss of the main task .
The algorithm in the figure above 1 Shows the entire computing flow .
The figure above shows the flow of the stop mechanism .
An important term in the loss function is αp.αp The bigger the value is. , The heavier the penalty item , So it can be stopped in advance token. Although effective for reducing calculations , But previous work on adaptive computing has found that , Training is right αp Your choice is very sensitive , Its value may not provide an accurate - Fine grained control of efficiency tradeoffs . therefore , The author introduced a distribution before regularization , such token You can exit at the target depth . under these circumstances , For an unlimited number of input images , The author expects token The depth of changes within the distribution prior . So , The author defines the stop fraction distribution :
This averages all... At each layer of the network token Expected stop score . This is used as an estimate of the distribution of the stopping probability among the layers , have access to KL Divergence adjusts this distribution to a predefined prior distribution . therefore , The new distribution prior regularization term is expressed as :
among ,KL Express KL The divergence ,Htarget Represents the target stop score distribution with the guidance stop layer . In this paper, the probability density function of Gaussian distribution is used to define a bell shaped distribution target Htarget, The distribution is at the expected stopping depth Htarget Goal centered . This encourages everyone token The sum of the expected stop points of triggers N Exit conditions for targets , Provides enhanced control over the desired residual computation .
This article is adaptive token The final loss function formula for calculating the parameters of the training network is as follows :
among αd Is a scalar coefficient , Used to balance distribution regularization with other loss terms .
experiment

The figure above shows that the ImageNet1K Verification set , Use the of this article A-ViT-T Adaptive control during reasoning token depth .
Upper figure (a) Describes the learning on the validation set token The average depth of , It is displayed as a two-dimensional Gaussian distribution centered on the center of the image . It's not the same as most ImageNet The fact that the sample is in the middle , Visually align with image distribution . Upper figure (b) Show all of each layer of each image token Block diagram of the average stop score on . The stop score increases gradually in the initial stage , Reach the peak , And then down .
You can view the adaption of each image token To analyze the difficulty of images in the network . therefore , In the diagram above , The author describes the difficult sample and the simple sample according to the required calculation .
An adaptive reasoning paradigm is presented , Note that this analyses the variation in classification accuracy of different categories relative to the complete model . The author calculated the change of class verification accuracy before and after the application of adaptive reasoning . The author summarizes the qualitative and quantitative results in the above table .
The above table shows the methods and research of this paper Transformer Comparison of the existing technologies of the dynamic reasoning stopping mechanism .
To further visualize the most advanced DynamicViT Improvement , The author presents token Qualitative comparison of depth .
In the above table , The author compares the existing methods GPU Speedup ratio .
边栏推荐
- CICFlowMeter源码分析以及为满足需求而进行的修改
- Array seamless scrolling demo
- MYSQL数据高级
- GIS实战应用案例100篇(十四)-ArcGIS属性连接和使用Excel的问题
- 微信小程序學習之 實現列錶渲染和條件渲染.
- Oracle database listening file configuration
- Latex formula and table recognition
- How to improve the efficiency of network infrastructure troubleshooting and bid farewell to data blackouts?
- Mise en œuvre du rendu de liste et du rendu conditionnel pour l'apprentissage des applets Wechat.
- Thinkphp5 multi language switching project practice
猜你喜欢

Literature Research Report

Queue queue

《MATLAB 神经网络43个案例分析》:第32章 小波神经网络的时间序列预测——短时交通流量预测

操作符详解

YOLOv6:又快又准的目标检测框架开源啦

英伟达这篇CVPR 2022 Oral火了!2D图像秒变逼真3D物体!虚拟爵士乐队来了!

二叉樹第一部分

有关二叉树 的基本操作

415 binary tree (144. preorder traversal of binary tree, 145. postorder traversal of binary tree, 94. inorder traversal of binary tree)

Prct-1400: failed to execute getcrshome resolution
随机推荐
ByteDance Interviewer: talk about the principle of audio and video synchronization. Can audio and video be absolutely synchronized?
411 stack and queue (20. valid parentheses, 1047. delete all adjacent duplicates in the string, 150. inverse Polish expression evaluation, 239. sliding window maximum, 347. the first k high-frequency
About thinkphp5, use the model save() to update the data prompt method not exist:think\db\query- & gt; Error reporting solution
Thinkphp5 clear the cache cache, temp cache and log cache under runtime
el-table表格的拖拽 sortablejs
How do novices choose the grade of investment and financial products?
Arbre binaire partie 1
[Eureka source code analysis]
JCIM|药物发现中基于AI的蛋白质结构预测:影响和挑战
NVIDIA's CVPR 2022 oral is on fire! 2D images become realistic 3D objects in seconds! Here comes the virtual jazz band!
impdp导schema报ORA-31625异常处理
Basic operations on binary tree
Literature Research Report
分布式 | 如何与 DBLE 进行“秘密通话”
Cookie encryption 4 RPC method determines cookie encryption
canvas无限扫描js特效代码
Endgame P.O.O
MySQL data advanced
Go language development environment setup +goland configuration under the latest Windows
Error reading CSV (TSV) file