当前位置：网站首页>New paradigm of semantic segmentation! Structtoken: Rethinking the per pixel classification paradigm

New paradigm of semantic segmentation! Structtoken: Rethinking the per pixel classification paradigm

2022-06-22 23:53:00 【Zhiyuan community】

Thesis link ：https://arxiv.org/abs/2203.12612

This paper is a reflection and improvement on the traditional pixel by pixel classification paradigm of semantic segmentation .

Previous work on semantic segmentation regarded it as a pixel by pixel classification task , The mainstream paradigm is codec structure , After learning pixel by pixel representation through encoder and decoder , By classifying each pixel into different categories, the predicted semantic mask results can be obtained . This article chooses another strategy , That is, the structure information is used as a priori to directly construct a semantic mask and then gradually refine it , Instead of a pixel by pixel classification paradigm .

The specific term , For a given input image , Learnable structures in the model token Interact with image representation , Thus, the final semantic mask is deduced . This idea and the original ViT The structure of the cls token Their behavior is quite similar . Considering that the implementation of this work is based on ViT-L, So you can intuitively deduce ,StructToken The idea is to Transformer The original form is a kind of intensive prediction tasks such as semantic segmentation “ direct ” transfer , This is not like other work , Too much influenced by the original convolutional neural network design paradigm in the target task .

Therefore, several points worth considering can be put forward from this ：

What is the structural information defined in this article ？
How the proposed design expresses these structural information ？
How to verify the improvement brought by these designs is related to the so-called structural information ？

Related work

A lot of work has been done in the field of semantic segmentation , But the current work can basically be classified as pixel by pixel classification paradigm , The difference mainly lies in whether the classification parameters are dynamic ：

Static pixel by pixel classification ： It continues the paradigm of classical work based on convolutional neural network in the field of segmentation , for example FCN. They mainly use the enhancement of context semantic information and the fusion of multi-scale features , Thus, more effective image feature representation . And use an independent static classifier （ Typical as 1x1 Convolution ） Achieve pixel by pixel semantic category prediction . However, this kind of work focuses on improving the expression ability of pixel by pixel features , However, the structural information in the image is not considered in the model design .
Dynamic pixel by pixel classification ： The idea of dynamic structure has been introduced into recent work . Besides the feature processing structure of the model itself , Classifiers are also beginning to change into dynamic forms . The typical jobs listed in the paper are Segmenter[Segmenter: Transformer for semantic segmentation]、MaskFormer[Per-pixel classification is not all you need for semantic segmentation]、Mask2Former[Masked-attention mask transformer for universal image segmentation] and K-Net[K-Net: Towards unified image segmentation]. They mainly use a series of learnable related to semantic categories token, Interact with the features of the image itself , So as to realize the final mask The forecast . In terms of form , It can be considered as a dynamic classification process . At the same time, these methods have not completely abandoned the classical paradigm mentioned above , Better overall performance . But from the perspective of this article , This kind of method still does not abandon the paradigm of pixel by pixel classification :).

On the whole, these works , Are learning linear discriminant functions for each class , Or static convolution , Or dynamic matrix multiplication . This will act on the per pixel feature representation , So as to give it the most relevant semantic category .

The authors believe that , The process of recognizing objects according to people , First, capture the structural information of semantic categories （ Shape, etc. ）, Then focus on the internal details . You want to segment regions of different semantic categories in an image , Usually, a rough... Is generated first according to the structure mask, After that, it is adjusted mask The details of the . The two existing pixel by pixel classification paradigms do not fully reflect this process , And more , Directly classify the pixels on the feature map of the penultimate layer of the model to obtain the score map . This feature encourages the network to optimize the representation of a single pixel , The most important structural features are neglected or even destroyed .

In this paper, the authors propose a structural prior paradigm to solve this problem , Directly from the structure token Construct score graph to segment image , Then gradually refine .

primary coverage

In this paper, we mainly study how to get the structure from the feature graph token Extract useful information . The proposed structure follows such a process as a whole ：

The extracted features ： Use Transformer Backbone network , for example ViT, Extracting feature maps F, The size is [C,H/16,W/16].
Structure token： Random initialization of learnable structures token S, The size is [K,N],K Is the number of data set categories ,N by patch Number , namely [H/16,W/16].
Information exchange ： Use an interactive structure to handle S. Capture the structure information in the feature graph , And according to the learned priors, a rough mask.
Feature refinement ： independent FFN For structure token The refined , And process the characteristic diagram .
Cascade processing ： Stack multiple base units （ Including interaction and refinement ） To repeatedly process features .
Predicted results ： The tail uses a convolution block composed of two convolution layers and skip links to refine the final segmentation mask And get the final result .

In these steps , The design of interactive structure is the core of this paper . This paper mainly explores three interactive structures . It contains two dynamic structures and one static structure .

Dynamic structure ： be based on Attention The idea of , But the calculations are relevant token Not space patch, It's channel based , namely S Category in token and F Interaction between feature channels in .
- The first one is CSE be based on Cross-Attention normal form , After linear transformation ,S Generate Query,F Generate Key and Value, Send in Cross-Attention. The results obtained here are consistent with S In the same shape . According to the diagram , There is also a split operation , However, the paper does not specify how to achieve .
- The second kind SSE be based on Self-Attention normal form ,S and F After stitching along the channel, it is obtained by linear transformation Query、Key and Value, And send it to Self-Attention. The result is split according to the original scale of the channel .
Static structure ： Use it directly 1x1 Convolution processing SSE Calculation of similarity attention in modules . The convolution result is the final result corresponding to the result before splitting . This process uses 1x1 Convolution directly mixes information from different input channels , It implements something like this SSE The process of .

The above structure is executing Attention Before the operation ,S and F Will be sent to the projection layer for processing , Although it is aimed at the channel Attention Handle , But the projection layer here uses 1x1 Convolution +3x3 Deep convolution +1x1 The form of convolution , It is still the operation of spatial dimension sharing .

The two outputs of these modules are connected to each other FFN. there FFN It uses FC+3x3 Grouping convolution +FC Structure . That is, local features can be refined , It can also be regarded as an implicit location coding .

原网站

版权声明
本文为[Zhiyuan community]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/173/202206222132485893.html

当前位置：网站首页>New paradigm of semantic segmentation! Structtoken: Rethinking the per pixel classification paradigm

New paradigm of semantic segmentation! Structtoken: Rethinking the per pixel classification paradigm

Related work

primary coverage

边栏推荐

猜你喜欢

随机推荐