当前位置：网站首页>[paper notes] poly yolo: higher speed, more precise detection and instance segmentation for yolov3

[paper notes] poly yolo: higher speed, more precise detection and instance segmentation for yolov3

2022-06-25 15:14:00 【m0_ sixty-one million eight hundred and ninety-nine thousand on】

The paper

Thesis title ：POLY-YOLO: HIGHER SPEED, MORE PRECISE DETECTION ANDINSTANCE SEGMENTATION FOR YOLOV3

Submit ：IEEE Transactions on Pattern Analysis and Machine Intelligence

Address of thesis ：https://arxiv.org/pdf/2005.13243.pdf

Source code ：IRAFM AI / Poly-YOLO · GitLab

YOLOv3 Here comes the improved version ！ And YOLOv3 comparison ,Poly-YOLO The only training parameters are 60％, but mAP But it has improved 40％！ And propose a lighter Poly-YOLO Lite.

background

Target detection is a process , All important areas containing objects of interest are restricted and the background is ignored . Usually , The target is bounded by a box , The box is represented by the spatial coordinates of its upper left corner and its width and height . The disadvantage of this method is for objects with complex shapes , The bounding box also includes the background , Because the bounding box does not tightly wrap the object , The background will occupy a large part of the area . This behavior will reduce the performance of the classifier applied to the bounding box , Or it may not meet the requirements of accurate detection . To avoid this problem ,Faster R-CNN or RetinaNet The classical detector is modified to Mask R-CNN or RetinaMask Version of . These methods can also infer instance segmentation , That is, each pixel in the bounding box is classified as an object / Background class . These methods The limitation is their computational speed , They cannot achieve real-time performance on non high-level hardware .

Our focus is to create an accurate detector with instance segmentation and real-time processing ability of middle-level graphics card .

Preface

Poly-Yolo Instance segmentation case

The target detection model can be divided into two groups , Two stage and one stage detector . The two-stage detector splits the process as follows . In the first phase , Propose areas of interest （RoI）, At a later stage , Boundary box regression and classification are carried out in these candidate regions . The primary detector predicts the bounding box and its category at one time . Two stage detectors are usually more accurate in positioning and classification accuracy , But it is slower than the primary detector in processing . Both types include backbone networks for feature extraction and header networks for classification and regression . Usually , The backbone is some SOTA The Internet , for example ResNet or ResNext, stay ImageNet or OpenImages I did pre training on . For all that , Some methods Also try training from scratch .

The framework shared today proposes better performance YOLOv3 The new version , With an extension of Poly-YOLO.Poly-YOLO Based on the YOLOv3 Based on the original thought of , And eliminated its two weaknesses ： Tag rewriting and anchor The distribution is unbalanced .

Poly-YOLO Use stairstep Upper sampling passes hypercolumn Technology aggregation SE-Darknet-53 Features in the backbone to reduce problems , And produce high-resolution single-scale output . And YOLOv3 comparison ：Poly-YOLO The only trainable parameters are 60％, but mAP But it has improved 40％. With fewer parameters and lower output resolution Poly-YOLO Lite, Have and YOLOv3 Same accuracy , But three times smaller , Twice as fast , More suitable for embedded devices .

Today is mainly about how to solve Yolov3 Two major problems .

The figure on the left illustrates the... On the input image YOLO grid , The yellow dot indicates the center of the detected object . The right figure shows the test results .

The first question is ： Tag rewriting

Label rewriting is due to yolo The unique grid is responsible for predicting bbox Characteristics , There may be two objects assigned to the same anchor, As a result, only one object is retained for prediction , Another object is ignored as a background . When the input resolution is smaller , The denser the objects , Object's wh When the size is very close , Label rewriting is serious . As shown in the figure above , Red indicates the rewritten bbox, It can be seen that 27 An object has 10 One was rewritten .

say concretely , With 416 *416 Size image as an example , The image resolution decreases with convolution to 13 * 13 The size of the characteristic graph , At this time, the correspondence of a pixel in the feature map is 32*32 Size image patch. and YOLOV3 In training , If the same two targets appear, the center is located in the same cell, And assigned to the same anchor, Then the previous target will be rewritten by the later target , That is to say, when the center distance of two targets is too close to each other so that they will be sampled into the same pixel on the feature map , At this time, one of the goals will be rewritten and cannot be trained .

This phenomenon coco The reason why the data is not obvious is bbox Well distributed , Objects of different sizes are assigned to different prediction layers , The probability of label rewriting is relatively low . But in many practical applications , For example, when testing specific components in industry , The objects are arranged very closely , And almost the same size , At this point, the label rewriting problem may occur , The author's paper points out that in Cityscapes The phenomenon is also obvious in the data .

The second question is ：Anchor The distribution is unbalanced

yolo Series adoption kmean Clustering algorithm to get specific requirements 9 individual anchor, And in groups of three , For large output graphs ( Detect small objects ), Medium and small output layers ( Detect large objects ) Default anchor. It can be seen that objects of different sizes will be anchor Assigned to different prediction layers for prediction .

But this kind of kmean The result of the algorithm is problematic , In the actual project, we also found . As mentioned earlier, most target detection data sets for specific scenes , Not with coco Like a natural scene , There are all kinds of scales , Most of the objects in the actual project are about the same size , Or just a few specific scales , Use kmean This process will appear ： Objects of almost the same size are forced to be divided into different layers to predict , This training method is very strange for the network , Because between objects wh Maybe just a little , Forced hierarchical prediction , This is clearly unreasonable . In fact, the simulation data generated by the author of this paper is also this feature .

Author points out ,kmean This setup , Only in the ：M ∼ U(0, r) Use reasonable... In cases . among r Is the input picture resolution , for example 416. This formula means that the size distribution of the object satisfies the boundary of 0 To r Uniform distribution of , That is to say 416x416 On the picture , Of all sizes bbox Will exist ,kmean It is reasonable . But maybe most of the scenes are ：M ∼ N (0.5r, r), That is, the mean value is 0.5r, The standard deviation is r The distribution of objects , If by default kmean Algorithm to anchor Computing strategy , So since most objects are medium-sized objects , There will be two other branches that are not well trained , Or no training at all , Waste network .

New framework and solutions

For label rewriting , Only by either increasing the resolution size of the input picture ; Or increase the size of the output feature graph to achieve . The method of this paper is to increase the size of the output feature graph .

The original yolov3, The input size is the size of the output characteristic graph 8/16 and 32 times , From the above data, it can be found that the label rewriting proportion is quite high . By increasing the size of the output feature map, the rewriting ratio can be significantly reduced .

about kmean Problems brought by clustering , There are two solutions ：

kmean The clustering process remains unchanged , However, we should avoid the problem that small objects are assigned to the small output feature map for training and large objects are assigned to the large output feature map for training , Specifically, firstly, based on the network output layer , Define three approximate range scales , Then set two thresholds , Forcibly discretize the three scales ; Then on bbox Perform three separate clusters , Each cluster selects a specific... Within the previously specified range bbox Conduct , Instead of acting on the entire dataset . Mainly to guarantee kmean Only for specific bbox Size can be accessed within , You can avoid the above problems . But the disadvantages are also very obvious , If the objects are all about the same size , Then there is almost only one output layer with object allocation prediction , The other two scales are running there , Waste resources .
There is only one output layer , All objects are predicted in this layer . You can avoid kmean Clustering problem , But to prevent label rewriting , Therefore, the output resolution is increased , This is perfect . The author actually uses 1/4 Scale output , It belongs to high-resolution output , The rewriting probability is very low .

According to the figure above, we can find that ：

networking , In order to reduce the number of parameters , First, the number of channels is reduced , At the same time, in order to improve performance , Introduced SE Units to enhance features
and yolov3 The biggest difference is that the output layer is a , However, multi-scale fusion is also used
neck Part of it is proposed that hypercolumn+stairstep Up sampling operation

On the left is the standard hypercolumn operation , On the right is the author's proposal . Experiments show that the right way is better , because loss A lower .

Through the above parameter settings , The author designed neck and head Lighter , share 37.1M Parameters of , Significantly lower than YOLOv3 Of 61.5M,Poly-YOLO Than YOLOv3 More accurate , The trainable parameters are reduced 40% Under the circumstances ,mAP The accuracy is probably improved 40%. At the same time, in order to further speed up , The author also designed lite edition , The parameters are just 16.5M, Precision and yolov3 near .

Experiment and visualization

Left ： Rectangular grid , Taken from the YOLOv3. The cell whose center is the center of the target bounding box predicts its bounding box coordinates . Right picture ：Poly-YOLO Circular sector based mesh for detecting polygon vertices in . The center of the mesh coincides with the center of the target bounding box . Then each circular sector is responsible for detecting the polar coordinates of a specific vertex . Sectors without vertices should produce a confidence level equal to zero .