当前位置：网站首页>Reppoints: Microsoft skillfully uses deformation convolution to generate point sets for target detection, full of creativity | iccv 2019

Reppoints: Microsoft skillfully uses deformation convolution to generate point sets for target detection, full of creativity | iccv 2019

2022-06-24 07:39:00 【VincentLee】

RepPoints It's very ingenious , Using point sets rich in semantic information to represent targets , And skillfully using deformable convolution to achieve , The overall network design is very complete , Worth learning undefined

source ： Xiaofei's algorithm Engineering Notes official account

The paper : RepPoints: Point Set Representation for Object Detection

Address of thesis ：https://arxiv.org/abs/1904.11490
Paper code ：https://github.com/microsoft/RepPoints

Introduction

classical bounding box It's good for calculation , But it doesn't take into account the shape and attitude of the target , And the features obtained from the rectangular region may be seriously affected by the background content or other objects , Low quality features will further affect the performance of target detection . In order to solve bounding box The problem is , The paper proposes that RepPoints This new method of target representation , Can carry on the fine-grained localization ability as well as the better classification effect .

Pictured 1 Shown ,RepPoints It's a set of points , It can adaptively surround the target and contain the semantic features of the local region .RepPoints The training is driven by target location and target classification , Be able to restrain RepPoints Surround the target tightly and guide the detector to classify the target correctly . This adaptive representation is differentiable , It can be used continuously in multiple stages of the detector , And it doesn't need any extra settings anchor To generate a large number of initial boxes .

The RepPoints Representation

As mentioned above ,bounding box It's just a coarse-grained representation of the target location , Only the rectangular space of the target is considered , No consideration of shape 、 Gesture and semantic rich local areas , The semantic rich local area can help the network better positioning and feature extraction . In order to solve the above shortcomings ,RepPoints Using a set of adaptive sampling points to represent the target ：

$n$ To represent the total number of sampling points for the target , The default setting is 9.

Adjust gradually bounding box Location and feature extraction are multi-stage An important means of detector success , about RepPoints, Adjustment can be simply expressed as ：

${(\Delta xk, \Delta y_k)}^{n}{k=1}$ It is the offset value of the predicted new sampling point relative to the old sampling point , The sample points are all adjusted the same size , Don't like bouning box In that case, we need to solve the problem of the size inconsistency between the center point coordinates and the border length .

Converting RepPoints to bounding box

In order to take advantage of bounding box Training and verification of annotation information RepPoint-based The performance of the detection algorithm , Use the preset transformation method $\mathcal{T}=\mathcal{R}_P\to \mathcal{B}_P$ take RepPoints Into a pseudo prediction box , There are three ways of transformation ：

$\mathcal{T}=\mathcal{T}_1$： Min-max function, For all the RepPoints Conduct min-max Operation to get the prediction box $\mathcal{B}_p$
$\mathcal{T}=\mathcal{T}_2$：Partial min-max function, Yes, part of it RepPoints Conduct min-max Operation to get the prediction box $\mathcal{B}_p$
$\mathcal{T}=\mathcal{T}_3$：Moment-based function, adopt RepPoints The center point position and the size of the prediction box are calculated by the mean and standard deviation of $\mathcal{B}_p$, Dimensions are learned through globally shared parameters $\lambda_x$ and $\lambda_y$ Multiply to get

These functions are differentiable , It can be added to the detector for end-to-end Training for . Experimental verification , this 3 The results of the two transformation methods are good .

RPDet: an Anchor Free Detector

This paper is based on RepPoints Designed anchor-free Target detection algorithm RPDet, There are two identification phases . Because the deformable convolution can sample multiple irregular distribution points for convolution output , So deformable convolution is very suitable for RepPoints scene , It can guide the sampling points according to the feedback of the recognition results .

Center point based initial object representation

RPDet Take the center point as the initial target representation , And then gradually adjust to the final RepPoints, The center point can also be considered special RepPoints. When two targets exist in the same position of the feature graph , This kind of method based on center point usually has the problem of recognizing target ambiguity . Previous methods set multiple preset values at the same location anchor To solve this problem , and RPDet The use of FPN To solve this problem ：

Targets of different sizes are created by different users level It's the feature that's responsible for identifying
Small objects correspond to level The characteristic graph of is generally large , It reduces the possibility that the same object has the same location

The paper statistics found that , When using the above FPN After restraint ,COCO Only 1.1.% There are the above problems .

Utilization of RepPoints

Pictured 2 Shown ,RepPoints yes RPDet The basic goal representation method of , From the center point , The first group RepPoints Get... By the offset of the regression center point . The second group RepPoints Represents the final target location , By the first group RepPoints Optimize and adjust to get .RepPoints There are two main goals that drive our learning ：

Pseudo prediction box and GT The distance loss between the upper left corner and the upper right corner of the box
Subsequent target classification losses

The first group RepPoints Guided by distance loss and classification loss , The second group RepPoints Use only distance loss for guidance , The main purpose is to learn more accurate target positioning .

Backbone and head architectures

FPN The backbone network contains 5 Layer feature pyramid level, from stage3( Down sampling 8 times ) To stage7( Down sampling 128 times ).Head The structure of is shown in the figure 3,Head In different level China is a shared , Contains two independent subnets , They are responsible for positioning (RepPoints Generation ) And classification ：

Locate the subnet first using 3 individual 256-d $3\times 3$ Convolution feature extraction , Every convolution is connected to group normalization layer , Then connect two small networks to calculate two groups RepPoints Offset value .
The classification subnet first uses 3 individual 256-d $3\times 3$ Convolution feature extraction , Every convolution is connected to group normalization layer , Then the first set of subnet outputs will be located RepPoints Enter the offset value of into 256-d $3\times 3$ Further feature extraction from deformable convolution , Finally, the classification results are output .

Even though RPDet Two stage positioning is adopted , But its performance is even better than that of single-stage RetinaNet higher , Mainly anchor-free This design reduces the computation of classification layer , It covers a small amount of consumption caused by the extra positioning phase .

Localization/class target assignment

Positioning consists of two stages , The first stage is to get the first set from the center point RepPoints, The second stage starts with the first group RepPoints Adjust to get the second group RepPoints, Positive samples are defined differently in different stages ：

For the first stage , The characteristic points are considered to be positive samples, which need to satisfy ：1) The feature pyramid where the feature point is located level be equal to $s(B)=\lfloor log_2 (\sqrt{W_Bh_B}/4)\rfloor$.2) The mapping position of the center point of the target on the feature graph corresponds to the feature point .
For the second stage , Only the pseudo prediction frame generated in the first stage corresponding to the feature point is consistent with the target's IoU Greater than 0.5 Is considered to be a positive sample . With the current anchor-based The method is a bit similar , Think of the output of the first stage as anchor.

Since the classification of targets only considers the first group RepPoints, therefore , The first set of characteristic points RepPoints The resulting pseudo prediction is based on the IoU Greater than 0.5 That is to say, it is a positive sample , Less than 0.4 It's the background class , Others ignore .

Experiments

Compare the performance of different pseudo prediction box generation methods .

And others SOTA Test methods compare performance .

Conclusion

RepPoints It's very ingenious , Using point sets rich in semantic information to represent targets , And skillfully using deformable convolution to achieve , The overall network design is very complete , Worth learning .

If this article helps you , Please give me a compliment or watch it ～undefined More on this WeChat official account 【 Xiaofei's algorithm Engineering Notes 】

原网站

版权声明
本文为[VincentLee]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/06/20210630154916912z.html