当前位置:网站首页>New hybrid architecture iformer! Flexible migration of convolution and maximum pooling to transformer
New hybrid architecture iformer! Flexible migration of convolution and maximum pooling to transformer
2022-06-22 15:01:00 【PaperWeekly】

PaperWeekly original · author | Jason
Research direction | Computer vision
Abstract
Recent research shows that ,Transformer Strong ability to build remote dependencies , However, it performs poorly in capturing high-frequency information that transmits local information . To solve this problem , The author puts forward a new general Inception Transformer, abbreviation iFormer, It can effectively learn the comprehensive features of high-frequency and low-frequency information in visual data .
To be specific , The author designed a Inception mixer, To transplant the advantages of convolution and maximum pooling , Capture high frequency information to Transformer. Unlike the recent hybrid framework ,Inception mixer Higher efficiency is achieved through the channel splitting mechanism , Using parallel convolution / The maximum pooling path and the self attention path are used as high frequency and low frequency mixers , At the same time, it can flexibly model the identification information scattered in a wide frequency range .
Consider that the bottom layer plays a greater role in capturing high-frequency details , The top layer plays a greater role in modeling low-frequency global information , The author further introduces a frequency gradient structure , That is, gradually reduce the size fed to the high frequency mixer , Increase the size of the feed to the low frequency mixer , It can effectively balance high frequency and low frequency components between different layers .
In a series of visual tasks, the author studied iFormer Benchmarking , And shows its application in image classification 、COCO Detection and ADE20K Amazing performance in segmentation . for example ,iFormer-S stay ImageNet-1K Up to 83.4% Of Top-1 precision , Than DeiT-S Higher than 3.6%, Even bigger than Swin-B(83.3%) Slightly better , Only 1/4 Parameters and 1/3 Of FLOPs.

Thesis and code address

Paper title :
Inception Transformer
Address of thesis :
https://arxiv.org/abs/2205.12956
Code address :
https://github.com/sail-sg/iFormer

Motivation
Transformer In natural language processing (NLP) The field has reached a climax , In many NLP Good performance has been achieved in the task , For example, machinetranslation and question answering . This is largely due to its powerful capabilities , Be able to model the long-term dependency relationship in the data by using the self attention mechanism . Its success has enabled researchers to study its adaptability to the field of computer vision , Vision Transformer(ViT) Is a pioneer work , This structure is directly inherited from NLP, But it is applied to the image classification with the original image block as the input . later , A lot has been developed ViT variant , To improve performance or extend to a wider range of visual tasks , for example , Target detection and segmentation .
ViT And its variants can capture low-frequency information in visual data , It mainly includes the global shape and structure of the scene or object , But for the high frequency of learning ( It mainly includes local edges and textures ) My ability is not very strong . This can explain intuitively :ViTs Used in non overlapping patch token The main operation of exchanging information between self-attention Is a global operation , Compared with local information ( Low frequency ), It is better suited to capture global information ( high frequency ).

Pictured 1(a) and 1(b) Shown , The Fourier spectrum and the relative logarithmic amplitude of Fourier indicate ,ViT Tends to capture low frequency signals well , But it rarely captures high-frequency signals . This shows that ViT It presents the characteristics of low-pass filter . This low frequency preference can damage VIT Performance of , because :1) Low frequency information filling in all layers may worsen the high frequency components , For example, local texture , And weaken VIT Modeling capabilities of ;2) High frequency information is also different , Can help accomplish many tasks , for example ( fine-grained ) classification .
actually , The human visual system extracts basic visual features at different frequencies : Low frequencies provide global information about visual stimuli , High frequency conveys local spatial changes in the image ( for example , Local edge / texture ). therefore , It is necessary to develop a new ViT structure , Used to capture high and low frequencies in visual data .
CNN It is the most basic pillar of general visual tasks . And VIT Different , They cover more local information by local convolution in the receptive field , Thus the high frequency representation can be extracted effectively . Recent research has considered CNN and VIT Complementary advantages of , Put it all together . Some methods stack convolution and attention layers serially , To inject local information into the global context .
Unfortunately , This serialization method only supports one type of dependency in one layer ( Global or local ) Modeling , And discard global information during local modeling , vice versa . Other work uses parallel attention and convolution to learn the global and local dependencies of input at the same time . However , Part of the channel is used to process local information , The other part is used for global modeling , This means that if you process all channels in each branch , Then the current parallel structure has information redundancy .
To solve this problem , The author puts forward a simple and effective Inception Transformer(iFormer), It will CNN The advantages of capturing high frequency are transplanted to ViT On .iFormer The key components in are Inception token mixer. The Inception mixer It aims to enhance... By capturing high and low frequencies in data VIT Perception in the spectrum .
So ,Inception mixer First, split the input feature along the channel dimension , Then the split components are sent to the high-frequency mixer and the low-frequency mixer respectively . ad locum , The high frequency mixer is composed of maximum pooling operation and parallel convolution operation , The low frequency mixer consists of ViTs Self attention in . In this way , In this paper, the iFormer It can effectively capture the specific frequency information on the corresponding channel , Thus, the ratio is learned in a wide frequency range ViT More comprehensive features .
Besides , The author found , Lower layers usually require more local information , The higher level needs more global information . This is because , Like the human visual system , The details in the high-frequency components help to capture the basic visual features at a lower level , And gradually collect local information , For a global understanding of the input . Inspired by this , The author designed a frequency ramp structure (frequency ramp structure). say concretely , From the lower level to the higher level , The authors gradually feed more channel sizes to low-frequency mixers and fewer channels to high-frequency mixers . This structure can balance high frequency and low frequency components among all layers .
Experimental results show that ,iFormer In image classification 、 Target detection and segmentation are better than SOTA Of ViTs and CNN. for example , Pictured 1(c) Shown , Under different model sizes ,iFormer Yes ImageNet-1K The popular frameworks on the have made consistent performance improvements , for example DeiT、Swin and ConvNeXt. meanwhile ,iFormer stay COCO Detection and ADE20K The segmentation aspect is superior to the recent framework .

Method
3.1 Revisit Vision Transformer
For visual tasks ,Transformers First, the input image is divided into a series of token, Every patch token Projected into a hidden representation vector , This vector has a more compact layer , Expressed as or , among N yes patch token The number of ,C Dimensions that represent features . then , Will all token Combined with location embedding , And feed it to include multiple heads of self attention (MSA) And feedforward networks (FFN) Of Transformer layer .

stay MSA in , Attention based mixers in all patch token Exchange information between , So it focuses on aggregating global dependencies across all layers . However , The over propagation of global information will enhance the low-frequency representation .
From the picture 1(a) The visualization of Fourier spectrum in , Low frequency information dominance ViT It means . But it actually hurts ViT Performance of , Because it may worsen the high-frequency components , For example, local texture , And weaken ViTs Modeling capabilities of . In visual data , High frequency information is also different , Can help accomplish many tasks . therefore , To solve this problem , The author puts forward a simple and effective Inception Transformer, As shown in the figure above , There are two key innovations , namely Inception mixer And frequency ramp structure .
3.2 Inception token mixer

The author puts forward a kind of Inception mixer, take CNN The powerful function of extracting high-frequency representation is transplanted to Transformer in . The detailed architecture is shown in the figure above .Inception mixer Instead of just putting the image token Input MSA Mixers , Instead, the input feature is first split along the channel dimension , Then input the split components into the high-frequency mixer and the low-frequency mixer . here , The high frequency mixer is composed of maximum pooling operation and parallel convolution operation , The low frequency mixer is realized by self attention .
Technically speaking , Given input feature mapping , take Decompose into... Along the channel dimension and , among . then , and Allocated to high frequency mixer and low frequency mixer respectively .
3.2.1 High-frequency mixer
Considering the sensitivity of the maximum filter and the detail perception of convolution , The author proposes a parallel structure to learn high frequency components . take It is divided into... Along the channel dimension and . As shown in the figure above , Embed maximum pooling and linear layers , Feed into linear and deep convolution :

among , and Indicates the output of the high frequency mixer .
Last , The outputs of low frequency and high frequency mixers are connected in series along the channel size :

The author designed a fusion module , stay patch To exchange information by deep convolution , At the same time, a cross channel linear layer is added , This layer works in every position , Like before Transformer equally . The final output can be expressed as :

And vanilla Transformer equally , In this paper, the iFormer Equipped with feedforward network (FFN), The difference is , It also integrates the above Inception token mixer(ITM);LayerNorm(LN) Used to ITM and FFN Before . therefore ,Inception Transformer Blocks are represented as :

3.2.2 Low-frequency mixer
Author use vanilla multi-head self-attention All in the low frequency mixer token To send messages . Although the ability to pay attention to learning global representation is very strong , However, the resolution of the feature map is larger than that of the low-level feature map . therefore , It is necessary to use an average pooling layer before attention operation to reduce The spatial scale of , And use an upsampling layer to restore the original spatial dimension after paying attention to the operation . This design greatly reduces the computational overhead , And focus attention on Embedding Global Information . This branch can be defined as :

among , Is the output of the low frequency mixer . The kernel size and step size of the pooling and upsampling layers are set to... Only in the first two stages 2.
3.3 Frequency ramp structure
In a general visual framework , The bottom layer plays a more important role in capturing high-frequency details , The top layer plays a more important role in modeling low-frequency global information . Like humans , By capturing the details of high-frequency components , Lower layers can capture basic visual features , And gradually collect local information , To achieve a global understanding of the input . Inspired by this , A frequency ramp structure is designed , It gradually divides more channel sizes from lower layer to higher layer to low frequency mixer , Thus, less channel size is reserved for the high frequency mixer .
To be specific , Pictured 2 Shown , Our backbone network has four stages , With different channels and spatial dimensions . For each piece , The author defines a channel ratio ( namely ,,,), To better balance the high-frequency and low-frequency components . In the proposed frequency ramp structure , From shallow to deep , Gradually reduce , and Gradually increase . therefore , With the help of flexible frequency ramp structure ,iFormer The high-frequency and low-frequency components of all layers can be effectively weighed .

experiment

The above table summarizes ImageNet Image classification accuracy of all comparison methods on . For smaller model sizes (∼20M), In this paper, the iFormer Beyond SoTA Of VIT And mixing VIT. To be specific , And SoTA Of ViTs( namely CSwin-T) And mixing ViTs( namely UniFormer-S) comparison , In this paper, the iFormer-S respectively 0.7% and 0.5% Of top-1 Accuracy advantage , Have the same or smaller model size at the same time .

The above table reports a larger resolution ( namely 384×384) Fine tuning accuracy . Can be observed ,iFormer The performance under different calculation settings is always better than that of similar models . These results clearly show that iFormer Advantages in image classification .

The table above reports the... Of the comparison model box mAP and mask mAP. In a similar computing configuration ,iFormers The performance of is better than that of all previous backbone networks .

In the above table , The authors report the mIoU result . In semantics FPN On the frame , In this paper, the iFormer It is always better than the previous trunk in this task , Include CNN and ( blend )VIT.

In the above table , It can be seen that attention is combined with convolution and maximum pooling , Better accuracy can be obtained than just paying attention to the mixer , While using less computational complexity , It means Inception Token Mixer. The effectiveness of the . Besides , The results in the lower part of the table prove , The rationality of frequency ramp structure and its potential in learning discriminative visual representation .

In order to further explore the scheme , The figure above shows Inception mixer. Pay attention to 、MaxPool and DwConv The Fourier spectrum of the branch .

In the diagram above , The author visualized ImageNet-1K Trained on iFormer-S and Swin-T Model Grad-CAM Activation map . It can be seen that , And Swin comparison ,iFormer It can be more accurate 、 More complete positioning of objects . for example , In the hummingbird image ,iFormer Jump over the branches and pay attention to the whole bird accurately , Including tail .

summary
In this paper , The author puts forward a kind of Inception Transformer(iFormer), This is a new type of universal Transformer The trunk .iFormer Adopt channel partition mechanism , Simply and effectively convolute / Maximize pooling and self - attention coupling , Make it more focused on high-frequency information , And expanded Transformer Perception in the spectrum . Based on Flexible Inception token Mixers , The author further designs a frequency ramp structure , It can make an effective trade-off between the high-frequency and low-frequency components of all layers . A lot of experiments show that ,iFormer In image classification 、 Object detection and semantic segmentation are better than typical vision Transformer, Shows iFormer Great potential as a general backbone of computer vision .
One drawback of this paper is that iFormer One obvious limitation of is , It needs to manually define the channel ratio in the frequency ramp structure , each iFormer The block and , This requires a wealth of experience to better define different tasks .
Read more

# cast draft through Avenue #
Let your words be seen by more people
How to make more high-quality content reach the reader group in a shorter path , How about reducing the cost of finding quality content for readers ? The answer is : People you don't know .
There are always people you don't know , Know what you want to know .PaperWeekly Maybe it could be a bridge , Push different backgrounds 、 Scholars and academic inspiration in different directions collide with each other , There are more possibilities .
PaperWeekly Encourage university laboratories or individuals to , Share all kinds of quality content on our platform , It can be Interpretation of the latest paper , It can also be Analysis of academic hot spots 、 Scientific research experience or Competition experience explanation etc. . We have only one purpose , Let knowledge really flow .
The basic requirements of the manuscript :
• The article is really personal Original works , Not published in public channels , For example, articles published or to be published on other platforms , Please clearly mark
• It is suggested that markdown Format writing , The pictures are sent as attachments , The picture should be clear , No copyright issues
• PaperWeekly Respect the right of authorship , And will be adopted for each original first manuscript , Provide Competitive remuneration in the industry , Specifically, according to the amount of reading and the quality of the article, the ladder system is used for settlement
Contribution channel :
• Send email :[email protected]
• Please note your immediate contact information ( WeChat ), So that we can contact the author as soon as we choose the manuscript
• You can also directly add Xiaobian wechat (pwbot02) Quick contribution , remarks : full name - contribute

△ Long press add PaperWeekly Small make up
Now? , stay 「 You know 」 We can also be found
Go to Zhihu home page and search 「PaperWeekly」
Click on 「 Focus on 」 Subscribe to our column
·

边栏推荐
- 网站存在的价值是什么?为什么要搭建独立站
- Struggle, programmer -- Chapter 47 the so-called Iraqis are on the water side
- Redistemplate serialization
- Deadlock found when trying to get lock; Try restarting transaction
- 网络安全的五大特点有哪些?五大属性是什么?
- Go all out to implement the flood control and disaster relief measures in detail and in place, and resolutely protect the safety of people's lives and property
- 加密市场进入寒冬,是“天灾”还是“人祸”?
- U++ programming array learning notes
- OpenVINO CPU加速调研
- 剑指Offer46——把数字翻译成字符串
猜你喜欢

Software architecture

扩散模型又杀疯了!这一次被攻占的领域是...

Reading of double pointer instrument panel (II) - Identification of dial position

visual studio开发过程中常见操作
![[graduation project] Research on emotion analysis based on semi supervised learning and integrated learning](/img/02/33d7b6a5bc01737c5dbeb944202a66.jpg)
[graduation project] Research on emotion analysis based on semi supervised learning and integrated learning

看完這篇 教你玩轉滲透測試靶機Vulnhub——DriftingBlues-5

Lisez ceci pour vous apprendre à jouer avec la cible de test de pénétration vulnhub - driftingblues - 5

【浙江大学】考研初试复试资料分享

KEIL仿真和vspd

Biden signs two new laws aimed at strengthening government cyber security
随机推荐
C language student management system (open source)
Vscode个性化设置:让一个小萌妹陪你敲代码
擴散模型又殺瘋了!這一次被攻占的領域是...
We will resolutely fight and win the hard battle of flood control and disaster relief and spare no effort to ensure the safety of people's lives and property
[Software Engineering] design module
C# Winform 相册功能,图片缩放,拖拽,预览图分页
Live broadcast goes to sea | domestic live broadcast room produces explosive products again. How can "roll out" win the world
Specific methods and steps of PROFINET communication between s7-200smart and Fanuc robot
Zhongshanshan: engineers after being blasted will take off | ONEFLOW u
那些令人懵逼的用户态&内核态
Struggle, programmer -- Chapter 46 this situation can be recalled, but it was at a loss at that time
Fight, programmer chapter 43 kill one man in ten steps
拜登签署两项旨在加强政府网络安全的新法案
全新混合架构iFormer!将卷积和最大池化灵活移植到Transformer
2022oracle数据库安装及使用
flutter video_player实现监听和自动播放下一首歌曲
[live broadcast review] battle code pioneer phase VI: build a test subsystem and empower developers to provide
OpenVINO CPU加速调研
All famous network platforms in the world
Support vector machine for machine learning


