当前位置:网站首页>[paper notes] contextual transformer networks for visual recognition
[paper notes] contextual transformer networks for visual recognition
2022-06-25 15:13:00 【m0_ sixty-one million eight hundred and ninety-nine thousand on】
The paper
Thesis title :Contextual Transformer Networks for Visual Recognition
Published in : CVPR 2021
Address of thesis :https://arxiv.org/pdf/2107.12292.pdf
Project address :GitHub - JDAI-CV/CoTNet: This is an official implementation for "Contextual Transformer Networks for Visual Recognition".
Preface
With self attention Transformer It triggered a revolution in the field of natural language processing , Recently also inspired Transformer The emergence of Architectural Design , It has achieved competitive results in many computer vision tasks .
For all that , Most existing designs are directly in 2D Self attention is used on the feature graph to obtain the attention matrix of independent queries and key pairs based on each spatial location , But it does not make full use of the Rich context . In the work shared today , The researchers designed a novel Transformer Style module , namely Contextual Transformer (CoT) block , For visual recognition . This design Make full use of the context information between input keys to guide the learning of dynamic attention matrix , So as to enhance the ability of visual representation . Technically speaking ,CoT The block first passes through 3×3 Convolution encodes the input key context , This produces a static context representation of the input .
Upper figure (a) It's traditional self-attention Use only isolated queries - Key pairs to measure the attention matrix , But the rich context between keys is not fully utilized .
(b) Namely CoT block
The researchers further linked the coded keys to the input query , Through two consecutive 1×1 Convolution to learn dynamic multi head attention matrix . The learned attention matrix is multiplied by the input value to realize the dynamic context representation of the input . The fusion of static and dynamic context representation is finally output .CoT This piece is very attractive , Because it can be easily replaced ResNet Every... In the architecture 3 × 3 Convolution , Generate a named Contextual Transformer Networks (CoTNet) Of Transformer Type backbone . Through a wide range of applications ( For example, image recognition 、 Object detection and instance segmentation ) A lot of experiments , Verified CoTNet Advantages of being a more powerful backbone .
background
Attention Attention mechanism and self-attention Self attention mechanism
Why the attention mechanism ?
stay Attention Before the birth of , There has been a CNN and RNN And its variant model , Then why introduce attention Mechanism ? There are two main reasons , as follows :
(1) Limitations of computing power : When you have to remember a lot “ Information “, The model will become more complex , However, computing power is still the bottleneck that limits the development of neural networks .
(2) Limitations of optimization algorithms :LSTM Only to a certain extent RNN Long distance dependence in , And information “ memory ” Ability is not high .
What is attention mechanism ?
Before introducing what is the attention mechanism , Let's look at a picture first . When you see the picture below , What will you see first ? When overload information comes into view , Our brains focus on the main information , This is the attention mechanism of the brain .
Again , When we read a sentence , The brain also remembers important words first , In this way, attention mechanism can be applied to natural language processing tasks , So people use the human brain to deal with information overload , Put forward Attention Mechanism .
self-attention It's one of the attention mechanisms , It's also transformer An important part of . Self attention mechanism It's a variant of the attention mechanism , It reduces dependence on external information , Better at capturing the internal correlation of data or features . Application of self attention mechanism in text , Mainly by calculating the interaction between words , To solve the problem of long-distance dependence .
frame
1 Multi-head Self-attention in Vision Backbones
ad locum , Researchers have proposed a general formula for scalable local multi head self attention in the visual trunk , Pictured above (a) Shown . Formally , Given size is H ×W ×C(H: Height ,W: Width ,C: The channel number ) The input of 2D Characteristics of figure X, take X Convert to query Q = XWq, key K=XWk, value V = XWv, Through embedding matrix (Wq, Wk, Wv). It is worth noting that , Each embedding matrix is implemented as 1×1 Convolution .
Local relation matrix R Further enriches each k × k Grid location information :
Next , Attention matrix A It is through the channel dimension of each head Softmax Operation pair enhanced spatial perception local relation matrix Rˆ Normalized to achieve :A = Softmax(Rˆ). take A The eigenvector of each spatial position is reshaped as Ch Local attention matrix ( size :k × k), The final output characteristic diagram is calculated for each k × k Aggregation of all values in the grid and the learned local attention matrix :
2 Contextual Transformer Block
Traditional self attention can trigger feature interaction in different spatial locations , It depends on the input itself . However , In the traditional self attention mechanism , All pairs of query key relationships are learned independently through isolated query key pairs , Without exploring the rich context . This seriously limits the application of self attention learning in 2D The ability to learn visual representation on feature maps .
To alleviate the problem , The researchers constructed a new Transformer Building blocks of style , Above figure (b) Medium Contextual Transformer (CoT) block , It integrates context information mining and self attention learning into a unified architecture .
3 Contextual Transformer Networks
ResNet-50 (left) and CoTNet50 (right)
ResNeXt-50 with a 32×4d template (left) and CoTNeXt-50 with a 2×48d template (right).
Experiment and visualization
Performance comparison of different ways to explore context information , That is, use only static context (Static Context), Use only dynamic context (Dynamic Context), Linear fusion of static and dynamic contexts (Linear Fusion), And the full version CoT block . The backbone is CoTNet-50 And use the default settings in ImageNet Training on .
stay ImageNet On dataset Inference Time vs. Accuracy Curve
The above table summarizes the results in COCO Use... On datasets Faster-RCNN and Cascade-RCNN Performance comparison of target detection in different pre training trunks . Will have the same network depth (50 layer /101 layer ) Visual trunk grouping . From observation , In the process of the training CoTNet Model (CoTNet50/101 and CoTNeXt-50/101) Show obvious performance , Yes ConvNets The trunk (ResNet-50/101 and ResNeSt-50/101) For all IoU Each network depth threshold and target size . The results basically prove that the integration self-attention The advantages of learning use CoTNet Context information mining in , Even if it is transferred to the downstream task of target detection .
Refer to the post
ResNet Super variant : JD.COM AI New open source computer vision module !( With source code )
边栏推荐
猜你喜欢
semget No space left on device
Mining procedure processing
Gif动图如何裁剪?收下这个图片在线裁剪工具
One question per day,
QT set process startup and self startup
Std:: vector minutes
Fishing detection software
Yolov3 spp Darknet version to caffemodel and then to OM model
Ubuntu 20.04 installing mysql8.0 and modifying the MySQL password
QT excel table read / write library - qtxlsx
随机推荐
5 connection modes of QT signal slot
System Verilog - function and task
Remove interval (greedy)
Disable scrolling in the iPhone web app- Disable scrolling in an iPhone web application?
Fishing detection software
Learning notes on February 8, 2022 (C language)
如何裁剪动图大小?试试这个在线照片裁剪工具
Brain tree (I)
Biscuit distribution
Why should the coroutine be set to non blocking IO
Function of getinstance() method
2.18 codeforces supplement
How to cut the size of a moving picture? Try this online photo cropping tool
【Try to Hack】vulnhub DC1
RDB and AOF persistence of redis
The robot is playing an old DOS based game
Qlogsystem log system configuration use
One question per day, punch in
QT excel table read / write library - qtxlsx
Software packaging and deployment