当前位置:网站首页>New paradigm of semantic segmentation! Structtoken: Rethinking the per pixel classification paradigm
New paradigm of semantic segmentation! Structtoken: Rethinking the per pixel classification paradigm
2022-06-22 23:53:00 【Zhiyuan community】

Thesis link :https://arxiv.org/abs/2203.12612
This paper is a reflection and improvement on the traditional pixel by pixel classification paradigm of semantic segmentation .
Previous work on semantic segmentation regarded it as a pixel by pixel classification task , The mainstream paradigm is codec structure , After learning pixel by pixel representation through encoder and decoder , By classifying each pixel into different categories, the predicted semantic mask results can be obtained . This article chooses another strategy , That is, the structure information is used as a priori to directly construct a semantic mask and then gradually refine it , Instead of a pixel by pixel classification paradigm .
The specific term , For a given input image , Learnable structures in the model token Interact with image representation , Thus, the final semantic mask is deduced . This idea and the original ViT The structure of the cls token Their behavior is quite similar . Considering that the implementation of this work is based on ViT-L, So you can intuitively deduce ,StructToken The idea is to Transformer The original form is a kind of intensive prediction tasks such as semantic segmentation “ direct ” transfer , This is not like other work , Too much influenced by the original convolutional neural network design paradigm in the target task .
Therefore, several points worth considering can be put forward from this :
- What is the structural information defined in this article ?
- How the proposed design expresses these structural information ?
- How to verify the improvement brought by these designs is related to the so-called structural information ?
Related work

A lot of work has been done in the field of semantic segmentation , But the current work can basically be classified as pixel by pixel classification paradigm , The difference mainly lies in whether the classification parameters are dynamic :
- Static pixel by pixel classification : It continues the paradigm of classical work based on convolutional neural network in the field of segmentation , for example FCN. They mainly use the enhancement of context semantic information and the fusion of multi-scale features , Thus, more effective image feature representation . And use an independent static classifier ( Typical as 1x1 Convolution ) Achieve pixel by pixel semantic category prediction . However, this kind of work focuses on improving the expression ability of pixel by pixel features , However, the structural information in the image is not considered in the model design .
- Dynamic pixel by pixel classification : The idea of dynamic structure has been introduced into recent work . Besides the feature processing structure of the model itself , Classifiers are also beginning to change into dynamic forms . The typical jobs listed in the paper are Segmenter[Segmenter: Transformer for semantic segmentation]、MaskFormer[Per-pixel classification is not all you need for semantic segmentation]、Mask2Former[Masked-attention mask transformer for universal image segmentation] and K-Net[K-Net: Towards unified image segmentation]. They mainly use a series of learnable related to semantic categories token, Interact with the features of the image itself , So as to realize the final mask The forecast . In terms of form , It can be considered as a dynamic classification process . At the same time, these methods have not completely abandoned the classical paradigm mentioned above , Better overall performance . But from the perspective of this article , This kind of method still does not abandon the paradigm of pixel by pixel classification :).
On the whole, these works , Are learning linear discriminant functions for each class , Or static convolution , Or dynamic matrix multiplication . This will act on the per pixel feature representation , So as to give it the most relevant semantic category .
The authors believe that , The process of recognizing objects according to people , First, capture the structural information of semantic categories ( Shape, etc. ), Then focus on the internal details . You want to segment regions of different semantic categories in an image , Usually, a rough... Is generated first according to the structure mask, After that, it is adjusted mask The details of the . The two existing pixel by pixel classification paradigms do not fully reflect this process , And more , Directly classify the pixels on the feature map of the penultimate layer of the model to obtain the score map . This feature encourages the network to optimize the representation of a single pixel , The most important structural features are neglected or even destroyed .
In this paper, the authors propose a structural prior paradigm to solve this problem , Directly from the structure token Construct score graph to segment image , Then gradually refine .
primary coverage

In this paper, we mainly study how to get the structure from the feature graph token Extract useful information . The proposed structure follows such a process as a whole :
- The extracted features : Use Transformer Backbone network , for example ViT, Extracting feature maps F, The size is [C,H/16,W/16].
- Structure token: Random initialization of learnable structures token S, The size is [K,N],K Is the number of data set categories ,N by patch Number , namely [H/16,W/16].
- Information exchange : Use an interactive structure to handle S. Capture the structure information in the feature graph , And according to the learned priors, a rough mask.
- Feature refinement : independent FFN For structure token The refined , And process the characteristic diagram .
- Cascade processing : Stack multiple base units ( Including interaction and refinement ) To repeatedly process features .
- Predicted results : The tail uses a convolution block composed of two convolution layers and skip links to refine the final segmentation mask And get the final result .
In these steps , The design of interactive structure is the core of this paper . This paper mainly explores three interactive structures . It contains two dynamic structures and one static structure .
- Dynamic structure : be based on Attention The idea of , But the calculations are relevant token Not space patch, It's channel based , namely S Category in token and F Interaction between feature channels in .
- The first one is CSE be based on Cross-Attention normal form , After linear transformation ,S Generate Query,F Generate Key and Value, Send in Cross-Attention. The results obtained here are consistent with S In the same shape . According to the diagram , There is also a split operation , However, the paper does not specify how to achieve .
- The second kind SSE be based on Self-Attention normal form ,S and F After stitching along the channel, it is obtained by linear transformation Query、Key and Value, And send it to Self-Attention. The result is split according to the original scale of the channel .
- Static structure : Use it directly 1x1 Convolution processing SSE Calculation of similarity attention in modules . The convolution result is the final result corresponding to the result before splitting . This process uses 1x1 Convolution directly mixes information from different input channels , It implements something like this SSE The process of .
The above structure is executing Attention Before the operation ,S and F Will be sent to the projection layer for processing , Although it is aimed at the channel Attention Handle , But the projection layer here uses 1x1 Convolution +3x3 Deep convolution +1x1 The form of convolution , It is still the operation of spatial dimension sharing .
The two outputs of these modules are connected to each other FFN. there FFN It uses FC+3x3 Grouping convolution +FC Structure . That is, local features can be refined , It can also be regarded as an implicit location coding .
边栏推荐
- Various schemes for lazy loading of pictures
- Why do we not use foreign keys now (2)?
- OJ daily practice - find the first character that only appears once
- 输出字符串中最长的单词
- 07 项目成本管理
- To establish a cloud computing "Kunlun", why should Lenovo hybrid cloud Lenovo xcloud?
- Notes on zhouguohua's reading
- Use smart doc to automatically generate interface documents
- 【ARM】讯为rk3568开发板lvds屏设置横屏显示
- 从类、API、框架三个层面学习设计可复用软件的具体技术学习心得
猜你喜欢

华为云如何实现实时音视频全球低时延网络架构【上】
![[STM32 skill] use the hardware I2C of STM32 Hal library to drive rx8025t real-time clock chip](/img/32/88321db57afb50ccc096d687ff9c41.png)
[STM32 skill] use the hardware I2C of STM32 Hal library to drive rx8025t real-time clock chip
声网多人视频录制与合成支持掉线再录制 | 掘金技术征文

昆仑分布式数据库Sequence功能及其实现机制

How to use enum data types

After passing the hcip exam, I still failed to change my career. What do professional network workers value most

Is it difficult to turn weak current into professional network worker? Huawei pre-sales engineers share their own experience

Digital data was invited to participate in Nantong enterprise digital transformation Seminar

Kunlundb query optimization (I)

PHP7.3报错undefined function simplexml_load_string()
随机推荐
美团基于 Flink 的实时数仓平台建设新进展
OJ每日一练——病毒的增生
冒泡排序 指针
昆仑分布式数据库技术优势
【GO】go语言interface
To establish a cloud computing "Kunlun", why should Lenovo hybrid cloud Lenovo xcloud?
Notes on zhouguohua's reading
07 项目成本管理
同步电路与跨时钟域电路设计2——多bit信号的跨时钟域传输(FIFO)
ES5 Object的扩展方法//call、apply 和 bind
OJ每日一练——删除单词后缀
Flutter outsourcing, undertaking flutter project
反向代理HAProxy
Ansible 学习总结(7)—— Ansible 状态管理相关知识总结
Finding the value of the nth term of Fibonacci sequence by recursion
Oracle ASM使用asmcmd中的cp命令来执行远程复制
[go] go polymorphism
Kunlundb query optimization (III) sort push down
OJ每日一练——单词的长度
Es5 object extension methods //call, apply and bind