This paper proposes an attention based approach BVR modular , Ability to fuse prediction frames 、 Center point and corner point are three target representations , And it can be seamlessly embedded into various target detection algorithms , It's a good return

source ： Xiaofei's algorithm Engineering Notes official account

The paper : RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

Address of thesis ：https://arxiv.org/abs/2010.15831
Paper code ：https://github.com/microsoft/RelationNet2

Introduction

There are many kinds of target representation methods in target detection algorithms , Pictured b Shown , There are rectangle based and key based . Different representation methods make the detection algorithm perform better in different aspects , For example, rectangular boxes can better align the annotation information , The center point is more conducive to small target recognition , Corners can be located more finely . This paper discusses whether multiple representations can be integrated into a single framework , Finally, an attention based decoding module is proposed BVR(bridging visual representations), The module and Transformer The mechanism of attention is similar to , The current target feature is enhanced by weighting other target features , Be able to integrate heterogeneous features of different representations .

With BVR The embedded anchor-based Methods as an example , Pictured a Shown , Representation acnhor As $query$, The center point and corner point of other representations are used as $key$, Calculation $query$ and $key$ Correlation weight between , Weight based integration $key$ To enhance $query$ Characteristics of . Scene for target detection , The weight calculation is accelerated in this paper , Respectively key sampling and shared location embedding, Used to reduce $key$ The number of and the amount of weight calculation . Except embedded in anchor-based Outside method ,BVR It can also be embedded into various forms of target detection algorithms .
The contributions of this paper are as follows ：

A general module is proposed BVR, It can fuse heterogeneous features of different target representations , With in-place Embedded into various detection frameworks , Do not destroy the original detection process .
Put forward BVR Acceleration method of module ,key sampling and shared location embedding.
After testing , stay ReinaNet、Faster R-CNN、FCOS and ATSS There is a significant improvement on the four detectors .

Bridging Visual Representations

Detection algorithms using different representations have different detection processes , Pictured 2 Shown ,BVR The original feature representation is based on the original algorithm , Add other representations as auxiliary features . Put the main feature $query$ And auxiliary features $key$ As input , The attention module enhances the main feature by weighting the auxiliary feature according to the relevance ：

$f^q_i$,$f^{'q}_i$,$g^q_i$ For the first time $i$ individual $query$ Input characteristics of the instance , Output features and geometric vectors ,$f^k_j$,$g^k_j$ For the first time $j$ individual $key$ Input features and geometric vectors of the instance ,$T_v(\cdot)$ Is a linear variation ,$S(\cdot)$ by $i$ and $j$ Correlation calculation between instances ：

$S^A(f^q_i, f^k_j)$ Is the similarity of appearance features , The calculation method is as follows scaled dot product.$S^G(g^q_i, g^k_j)$ For geometric position related items , First, compare the relative geometric vectors cosine/sine Location embedding, Two more floors MLP Calculate the relevance . Geometric vectors due to different representations (4-d Prediction box and 2-d spot ) Different , Need from 4-d The prediction frame extracts the corresponding 2-d spot ( Center or corner ), In this way, the geometric vectors of the two different representations are aligned .
At the time of implementation ,BVR The module uses a similar multi-head attention The mechanism of ,head Quantity defaults to 8, That's the formula 1 Of + After the number is changed to Concate Calculation of multiple associated features , The associated input feature for each dimension 1/8.

BVR for RetinaNet

With RetinaNet For example ,RetinaNet Set... At each position of the feature map 9 individual anchor, share $9\times H\times W$ A prediction box ,BVR The module will $C\times 9\times H\times W$ Feature map as input ($C$ Dimension of feature graph ), Generate enhancement features of the same size . Pictured a Shown ,BVR Use center and corner points as auxiliary $key$ features , The key points are through lightweight Point Head Network prediction , Then select a small number of points and input them into the attention module to enhance the classification features and regression features .

Auxiliary (key) representation learning

Point Head The network consists of two layers of shared $3\times 3$ Convolution , Then connect two independent subnets ($3\times 3$ Convolution +sigmoid), Each position in the prediction feature map is the center point ( Or corner ) Probability and its corresponding offset value . If the network contains FPN, All GT The center point and corner point are given to each layer for training , There is no need to rely on GT Size specifies the layer , In this way, more positive samples can be obtained , Speed up training .

Key selection

because BVR The module uses corners and centers as auxiliary representations , Each position of the feature map will output the probability that it is a key point . If each position of the feature map is used as a candidate position for corner and center points , Will generate super large $key$ Set , Bring a lot of computing consumption . Besides , Too many background candidates will also suppress the real corners and centers . In order to solve the above problems , The paper proposes top-k( The default is 50)$key$ selection strategy , Take corner selection as an example , Use stride=1 Of $3\times 3$MaxPool Convert the diagonal fractional graph , selection top-k Score position for subsequent calculation . To contain FPN Network of , Select all layers top-k Location , Input BVR Modules are not layered .

Shared relative location embedding

For each group $query$ and $key$, The formula 2 The geometric items of the need to input the relative position cosine/sine embedding as well as MLP Calculate the correlation degree after network conversion . The formula 2 The geometric complexity and memory complexity of the geometric item are $\mathcal{O}(time)=(d_0+d_0d_1+d_1G)KHW$ and $\mathcal{O}(memory)=(2+d_0+d_1+G)KHW$,$d_0$,$d_0$,$G$,$K$ Respectively cosine/sine embedding dimension ,MLP Dimensions within the network 、multi-head attention Modular head Quantity and choice $key$ Number , The amount of calculation and memory consumption are very large .

Because the relative position range of geometric vector is limited , Be in commonly $[-H+1, H-1]\times [-W+1, W-1]$ Within the scope of , Each possible value can be evaluated in advance embedding Calculation , Generate $G$ Dimensional geometry , Then, by bilinear sampling $key/query$ Value of pair . In order to further reduce the amount of calculation , Set each position of the geometry to represent the original $U=\frac{1}{2}S$ Pixel ,$S$ by FPN Layer of stride, such $400\times 400$ The characteristic diagram of can represent $[-100S, 100S)\times [-100S, 100S)$ Original graph . The amount of computation and memory consumption are also reduced to $\mathcal{O}(time)=(d_0+d_0d_1+d_1G)\cdot 400^2+GKHW$ and $\mathcal{O}(memory)=(2+d_0+d_1+G)\cdot 400^2+GKHW$.

Separate BVR modules for classification and regression

The representation of target center point can provide rich target category information , The corner representation can promote the positioning accuracy . therefore , The paper uses independent BVR Modules to enhance classification and regression features , Pictured a Shown , The center point is used to enhance the classification features , Corners are used to enhance regression features .

BVR for Other Frameworks

The paper is also in ATSS、FCOS and Faster R-CNN Try BVR Module embedding ,ATSS The access method is the same as RetinaNet Agreement ,FCOS Follow RetinaNet Also similar , Just take the center point as $query$ Representation , and Faster R-CNN The embedding of is shown in the figure 4 Shown , It uses RoI Aligin After the characteristics of , Others are similar .

Experiment

Sufficient comparative experiments have been carried out in this paper , You can go to the original text to see the specific experimental steps and key conclusions .

Conclusion

This paper proposes an attention based approach BVR modular , Ability to fuse prediction frames 、 Center point and corner point are three target representations , And it can be seamlessly embedded into various target detection algorithms , It's a good return .