当前位置：网站首页>Introduction to anchor free decision

Introduction to anchor free decision

2022-06-26 00:09:00 【Invincible Zhang Dadao】

1. background

Target detection starts from two_stage Time To one_stage Time , from anchor basic To anchor free, More and more refined . from 18 year CornerNet Start ,anchor free Paper jet neck explosion , Announce the beginning of anchor free Time .

2. Network

2.1 DenseBox

The work of this paper ：

Proved that simple FCN As long as the network is reasonably designed, it can be used to detect targets under different scales and severe occlusion .
Propose new FCN Model ,DENSEBOX, No regional proposal is required , It can be used to train end-to-end network .
Combined with the landmark localization Multi task learning makes densebox The accuracy is further improved .

Network architecture ： With vggNet by backbone The Internet ,input picture（m * n * 3） -> conv ->bilinear upsample -> threshold and NMS -> output(m/4 * n/4 * 5),
1） First, design a set of End to end multitask full convolution model , Directly regress the confidence degree of the appearance of the object and its relative position .
2） At the same time, in order to better deal with objects with serious occlusion , Improve the recall rate of small objects , He introduced... Into the detection network Upper sampling layer , and Integrating shallow networks The resulting features , Get a larger output layer .
3） To screen training samples , Ensure that the positive and negative samples are balanced , Reduce false detection , He also took the lead in using Online Hard Negative Mining The strategy of , And difficult case analysis .
4） Each pixel is converted to a confidence level and to the target bounding box bbox Four distances , Then proceed NMS.

stay FCN Adding a few layers to the structure can achieve landmark localization, And then through fusion landmark heatmaps and score map It can further improve the test results .

2.2 YOLOV1

YOLOV1 As anchor free A masterpiece of （YOLOV2 And V3 All with anchor Frame network architecture ）
Insert picture description here

Input the image as 448x448x3 Color image of , after GoogLeNet Before 20 Layer for convolution output 14x14x1024 Characteristic graph
Then it passes through four convolution layers and 2 All connection layers , Finally, it is reordered into a 7x7x30 Matrix （ tensor ）

As shown in the figure above , After layers of convolution , Output 7730 The matrix of is equivalent to dividing the original image into 7x7=49 Boxes , Each frame consists of a 30 The vector of dimensions constitutes 30 Dimension vector , front 10 The two features predict two bbox（BoundingBox Regression box ）, A horizontal box , A column box , Each box 5 Features . Next to the box is the category , common 20 Classes . The features are in the two boxes of probability prediction that each target belongs to a certain class , Of each box 5 Features , Namely ：(1)bbox center x be relative to grid cell（ The small red box in the figure ） Coordinates of (2)bbox center y be relative to grid cell Of y coordinate (3)bbox The width of (4)bbox The height of (5) Is there a goal , The goal of existence is 1xIoU Value , otherwise 0xIoU Value
Predicted x,y,w,h The value range of is all delimited to （0,1） In the open section , The conversion method is shown in the following figure

On the division of 7*7 Of 49 A cell , There will always be a 30 Dimension vector
Each cell is responsible for a single target , If the centers of two or more targets exist in a cell at the same time , Then only the category with the highest probability is saved in the cell

LOSS Calculation

2.3 CornerNet

Overall network architecture ：backbone by hourglass The Internet , Then add two prediction modules .

Simplify ：

1.1 hourglass The Internet
The principle is similar to resNet The Internet , And sampling through convolution in the early stage , Fuse with the value of subsequent upsampling , Obtain characteristic maps of different scales ,. For subsequent pooling.
1.2 corner pooling

Two feature map, Take the same position , To the first feature map Take this column as the pixel point at this position on the max pooling; For the second feature map Pixels at this position on the , The maximum value of the right row starting from it （max pooling+1）, Add the two maximum values , This is the output of this position . Do this for all locations , Get a complete output, This is a complete top-left corner pooling. Empathy ,bottom-right corner pooling Is to look up and take the maximum value , Look left to get the maximum value , And then add up .
1.3 Prediction module
The output of each prediction module is divided into Heatmaps,Embedding, and offsets Three parts , Their respective function is to point out the position of the corner , Corner pairing , Deviation correction .
heatmap： Yes C individual chanel,C Is the number of target categories . No background chanel. Every chanel Are binary masks , Used to indicate the position of the corner , Yes , It is our ultimate goal to find a point .
Embedding： For corner pairing . You have a pile of top-left corners, Another pile bottom-right corners, Then where do you know who should be a couple with whom . Here is the human posture estimation , The idea of pairing joint points , Assign one for each corner Embedding, Just think of it as an identity card . The color of each object's ID card is different , Those who get the identity cards of the same color are the whole family . Here is the embedding The closest value top-left corner and bottom-right corner Make a pair to draw a frame .
offset： The offset . Why calculate this thing . In the author's experiment , Input is 511∗511（ It seems that I remember ）, however heatmap yes 128∗128. Enter the point on the (x,y)(x,y)(x,y) Insinuate to heatmap On , It has to be ([x∗128/511],[y∗128/511]), Don't worry about the result calculated by others , When you see the rounding symbol, you know that you have to lose precision , And then heatmap When the position found on is mapped back , That must be wrong , So there was offset（128∗128∗2,x,y1281282,x,y128∗128∗2,x,y Offset in both directions ）.
The specific operation is ： First pair heatmap Non maximum suppression , And then take top 100 Of top-left and top 100 Of bottom-right The corner of , And then use offset Correct the position of these corners . And then calculate top-left and bottom-right Corner point Embedding Of L1 distance , Distance greater than 0.5 Or there are different kinds of corners that do not deserve to walk into the palace of marriage hand in hand . Those who can walk into the palace of marriage will get married , This pair can be used to draw a frame .
1.4 loss function

Go on ===================================