当前位置：网站首页>【NeurIPS】ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

【NeurIPS】ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

2022-07-16 06:25:00 【AI frontier theory group @ouc】

Please add a picture description

The paper ：https://openreview.net/forum?id=_WnAQKse_uK
Code ：https://github.com/Annbless/ViTAE

1、Motivation

The idea of this paper is very simple ： take CNN and VIT combination , For shallow layer CNN, Deep use VIT. meanwhile , stay attention Branch add a convolution Branch .

2、Method

The overall network architecture is shown in the figure below , Consists of three Reduction Cell （RC） And some Normal Cell（NC）.

Please add a picture description

RC modular

and VIT Of Transformer block comparison ,RC One more. pyramid reduction , It is multi-scale hole convolution parallel , Finally spliced into a . meanwhile , stay shortcut in , More 3 A convolution . Last , still more seq2img Turn into feature map.

NC modular

and VIT Of transformer block The difference is calculation attention There is one more convolution Branch .

3、 Interesting places

from openreview According to my opinion , Approved by the reviewer strong points:

The idea of injecting multi-scale features is interesting and promising.
The paper is well written and easy to follow.

meanwhile , There are also some weak links in the paper ：

The paper use an additional conv branch together with the self-attention branch to construct the new network architecture, it is obvious that the extra conv layers will help to improve the performance of the network. The proposed network modification looks a little bit incremental and not very interesting to me.
There are no results on the downstream object detection and segmentation tasks, since this paper aims to introduce the inductive bias on the visual structure.
The proposed method is mainly verified on small input images. Thus, I am a little bit concerned about its memory consumption and running speed when applied on large images (as segmentation or detection typically uses large image resolutions).

原网站

版权声明
本文为[AI frontier theory group @ouc]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/197/202207131733390531.html