当前位置：网站首页>Yolov4 reading notes (with mind map)! YOLOv4: Optimal Speed and Accuracy of Object Detection

Yolov4 reading notes (with mind map)! YOLOv4: Optimal Speed and Accuracy of Object Detection

2022-06-25 20:37:00 【SophiaCV】

Today I saw YOLOv4 Time of day , A little excited and excited , Waiting for a long time YOLOv4, You still appear after all

Address of thesis ：https://arxiv.org/pdf/2004.10934.pdf

GitHub Address ：https://github.com/AlexeyAB/darknet

I think the author is very authentic , The paper attached open source , There's nothing happier than this ！

First of all, a mind map for the summary of the paper is attached , To help you better understand ！

（ Mind map and paper translation PDF Can be found in the official account 【 The computer vision Alliance 】 reply YOLOv4 obtain ）

Below is the translation of the paper , Some places may not be well prepared for translation , Welcome to correct and supplement

Abstract

There are many features that can improve convolutional neural networks （CNN） The accuracy of the . A combination of these features needs to be tested on a large dataset , And the results need to be proved theoretically . Some features only run on some models , And it only works on certain issues , Or just run on small datasets ; And some of the characteristics （ For example, batch normalization and residual linking ） For most models , Tasks and datasets . We assume that such general features include weighted residual Links （WRC）, Cross phase partial connection （CSP）, Cross small batch Standardization （CmBN）, Self confrontation training （SAT） and Mish Activate . We use the following new features ：WRC,CSP,CmBN,SAT,Mish Activate , Mosaic data enhancement ,CmBN,DropBlock Regularization and CIoU The loss of , And combine some of these features to achieve the latest results ：** stay MS COCO On the data set Tesla V10 With 65 FPS The real-time speed of 43.5% Of AP（65.7％AP50）.** Open source links ：https://github.com/AlexeyAB/darknet.

1、 Introduce

Most are based on CNN The object detector is only suitable for the recommended system . for example , Search for free parking spaces with city cameras executed by slow precise models , And car collision warnings are related to fast and imprecise models . To improve the accuracy of real-time object detectors, not only can they be used in the prompt generation recommendation system , It can also be used for independent process management and reduced manual input . General graphics processing unit （GPU） The real-time object detector operation on allows it to be run at an affordable price . The most accurate modern neural networks can't run in real time , It takes a lot of GPU Do a lot of mini-batch-size Training . We create by creating in regular GPU Running in real time CNN To solve such problems , And the training only needs a traditional GPU.

The main goal of this work is to design a fast target detector in the production system , And optimize the parallel computing , Instead of designing a theoretical index with low computational complexity (BFLOP). We want to design objects that are easy to train and use . Pictured 1 Medium YOLOv4 The results show that , Anyone who uses traditional GPU Training and testing , You can get real-time 、 High quality and convincing target detection results . Our contribution is summarized as follows ：

We have developed an efficient and powerful target detection model . It allows everyone to use 1080Ti or 2080TiGPU To train super fast 、 Precise object detector .
We validated the latest “ Free bag ” and “ Special bag ” The influence of detection methods on detector training .
We've modified the latest method , Make it more effective , More suitable for single GPU Training , Include CBN[89]、PAN[49]、SAM[85] etc. .

2、 Related work （Related work）

2.1 Target detection algorithm （Object detection models）

Target detection algorithm generally consists of two parts ： One is in ImageNet The skeleton of pre training （backbone）,, The other is to predict the object category and the header of the bounding box . For in GPU Detectors running on the platform , The backbone can be VGG [68],ResNet [26],ResNeXt [86] or DenseNet [30]. For in CPU The detectors running on the platform , The trunk can be SqueezeNet [31],MobileNet [28、66、27、74] or ShuffleNet [97、53]. For the head , It's usually divided into two categories , First level object detector and second level object detector . The most representative two-level object detector is R-CNN [19] series , Include fast R-CNN [18],faster R-CNN [64],R-FCN [9] and Libra R-CNN [ 58]. It is also possible to make the two-level object detector an anchor free object detector , for example RepPoints [87]. For the first level target detector , The most representative model is YOLO [61、62、63],SSD [50] and RetinaNet [45]. In recent years , Developed an anchor free （anchor free） Primary object detector . This kind of detector is CenterNet [13],CornerNet [37、38],FCOS [78] etc. . In recent years , Single stage target detectors without anchors have been developed , This kind of detector has CenterNet[13]、CornerNet[37,38]、FCOS[78] etc. . Target detectors developed in recent years often insert layers between the trunk and the head , These layers are usually used to collect feature maps of different stages . We can call it the neck of the object detector . Usually , The neck consists of several bottom-up paths and several top-down paths . Networks with this mechanism include feature pyramid networks (FPN)[44]、 Path aggregation network (PAN)[49]、BiFPN[77] and NAS-FPN[17]. In addition to the above model , Some researchers have focused on building a new trunk directly for detecting objects （DetNet [43],DetNAS [7]） Or a whole new model （SpineNet [12],HitDe-tector [20]） On .

To make a long story short , A common detector consists of the following parts ：

Input ： Images , plaque , Image pyramid
skeleton ：VGG16 [68],ResNet-50 [26],SpineNet [12],EfficientNet-B0 / B7 [75],CSPResNeXt50 [81],CSPDarknet53 [81]
neck ：
- Other pieces ：SPP [25],ASPP [5],RFB [47],SAM [85]
- Path aggregation blocks ：FPN [44],PAN [49],NAS-FPN [17] ],Fully-connected FPN,BiFPN [77],ASFF [48],SFAM [98]
Heads ：
- Intensive Forecasting （ A stage ）：
  - RPN[64],SSD [50],YOLO [61], RetinaNet [45]（ Based on the anchor ）
  - CornerNet[37],CenterNet [13],MatrixNet [60],FCOS [78]（ Without anchor ）
Sparse prediction （ Two stages ）：
- Faster R-CNN [64],R-FCN [9],Mask R-CNN [23]（ Based on the anchor ）
- RepPoints[87]（ Without anchor ）

2.2 Bag of freebies

Usually , Traditional object detectors are trained offline . therefore , Researchers have always liked to take this advantage and develop better training methods , Thus, the target detector can obtain better accuracy without increasing the reasoning cost . We call these methods that only change training strategies or increase training costs “ Free gift ”. Data detection is often used in object detection methods and conforms to the definition of free gifts .** The purpose of data enhancement is to increase the variability of the input image , Thus, the object detection model designed has higher robustness to images obtained from different environments .** for example , Photometric distortion and geometric distortion are two commonly used data enhancement methods , They are undoubtedly good for object detection tasks . When dealing with photometric distortion , We adjust the brightness of the image , Contrast , Hue , Saturation and noise . For geometric distortion , We added random scaling , tailoring , Flip and rotate .

The data enhancement method mentioned above is full pixel adjustment , All the original pixel information in the adjustment area is retained . Besides , Some data enhancement researchers focus on the occlusion of simulated objects . They have achieved good results in image classification and target detection . for example ,random erase[100] and CutOut [11] You can randomly select rectangular areas in the image , And fill in the random or complementary value of zero . as for hide-and-seek[69] and grid mask[6], They randomly or evenly select multiple rectangular regions in the image , And replace it with all zeros . If similar concepts are applied to feature maps , Then there are DropOut [71],DropConnect [80] and DropBlock [16] Method . in addition , Some researchers have proposed ways to perform data enhancement using multiple images together . for example ,MixUp [92] Use two images to multiply and overlay images with different coefficients , Then use these superimposed coefficients to adjust the label . about CutMix [91], It is to cover the rectangular area of other images with the cropped image , And adjust the label according to the size of the mixing area . In addition to the above methods , Style transfer GAN [15] It's also used for data expansion , This usage can effectively reduce CNN The texture deviation learned .

Different from the various methods proposed above , Other freebies are designed to solve the problem of possible deviation of semantic distribution in data sets . When dealing with the problem of semantic distribution bias , One of the most important problems is the data imbalance between different classes , This problem is usually through the two-stage object designer in the difficult case mining [72] Or online hard case mining [67] To solve the . However, the instance mining method is not suitable for the first level target detector , Because this detector is a dense prediction Architecture . therefore Linet etc. [45] Focus loss is proposed , To solve the problem of data imbalance between various categories . Another very important question is , It's hard to use. one-hot representation To express the relationship between different categories . This representation is often used when executing tags .[73] The label smoothing proposed in this paper is to transform hard tags into soft tags for training , This makes the model more robust . In order to get better soft tags ,Islamet wait forsomeone [33] The concept of knowledge distillation is introduced to design tag optimization network .

The last gift is the objective function of boundary box regression . The traditional mean square error detector is usually used (MSE) Direct pair BBox The coordinates of the center point and the height of 、 Width regression , namely {xcenter,ycenter,w,h}, Or to the top left and the bottom right , namely {xtopleft,ytopleft,xbottomright,ybottomright} Regression . For an anchor based approach , Is to estimate the corresponding offset , for example {xcenterOffset,ycenterOffset,wOffset,hoffset} and {xtopleftoffset,ytopleftoffset,xbottomright toffset,ybottomright toffset}, for example {xtopleftoffset,ytopleftoffset} and {xtopleftoffset,ytopleftoffset}. However , Estimate directly BBox The coordinates of each point in are treated as independent variables , In fact, the integrity of the object itself is not taken into account . In order to better deal with this problem , Recently, some researchers have proposed that IoU loss[90], Will forecast BBOX The coverage of the area and the reality of the ground BBOX The coverage of the area is taken into account .IOU loss The calculation process will trigger BBOX The calculation of the four coordinate points of , The method is to execute a debit note with ground truth , Then we link the generated results into a complete code . because IOU It's a scale invariant representation , It can solve the traditional calculation {x,y,w,h} Of l1 or l2 A loss , The loss will increase with the increase of scale . lately , Searchers continue to improve the loss of IOU . for example ,GIOU Loss [65] In addition to the coverage area , It also includes the shape and orientation of the object . They propose to find ways to cover predictions at the same time BBOX And the reality BBOX The smallest area of BBOX, And use this BBOX As the denominator to replace the denominator used in the loss of IOU . about DIoU loss[99], It also takes into account the distance between the center of the object , and CIoU Loss [99] The overlapping area is also considered 、 The distance and aspect ratio between the center points . In solving BBox Back to the question ,Ciou It can achieve better convergence speed and precision .

2.3 Bag of spedials

For those plug-in modules and post-processing methods that only add a small amount of reasoning cost but can significantly improve the accuracy of object detection , Let's call this “ Special offers ”. generally speaking , These plug-in modules are used to enhance certain properties in the model , For example, expanding the acceptance domain , Introduce attention mechanism or enhance the ability of feature integration , Post processing is the method used to filter the prediction results of the model .

The general module that can be used to enhance the receiving domain is SPP [25],ASPP [5] and RFB [47]. SPP Modules originate from spatial pyramid matching （SPM）[39],SPM The original method of is to divide the function diagram into several x Unequal blocks , among {1,2,3,…} It could be a spatial pyramid , Then extract the bag features . SPP take SPM Integrated into the CNN in , And use the maximum pool operation instead of bag operations . because Heet And so forth SPP modular . [25] One dimensional eigenvector will be output , In a fully convolutional network （FCN） It's not feasible to use . therefore , stay YOLOv3 In the design of [63],Redmon and Farhadi take SPP The module is improved to a kernel size of k×k Of the maximum pool output of , among k = {1,5,9,13}, The step size is equal to 1. In this design , Relatively large k×kmax Pooling effectively increases the acceptance range of backbone features . Add an improved version of SPP After the module ,YOLOv3-608 stay MS COCOobject The test mission will AP50 The upgrade 2.7％, And the extra cost of calculation is 0.5％.ASPP[5] Modules and improved SPP The operational differences between modules mainly come from the original k× kkernel size , The maximum convolution step is equal to 1 To 3×3 Kernel size , The expansion ratio is equal to tok, The step size is equal to 1. RFB The module will use k×kkernel Several expansion convolutions of , Expansion ratio equalstok And the stride is equal to 1 To get more than ASPP More comprehensive space coverage . RFB [47] Just spend 7％ The extra inference time of MS COCO On SSD Of AP50 Improve 5.7％.

The attention module in object detection is mainly divided into channel attention and spot attention , The representative of these two attention models is squeezing excitation (SE)[29] And spatial attention module (SAM)[85]. although SE Modules in Im-ageNet Image classification tasks can be improved 1% Of TOP-1 Accuracy rate , But in GPU It usually adds 10% About the reasoning time , So it's more suitable for mobile devices , although SE Modules in Im-ageNet Image classification tasks can be improved 1% Of TOP-1 Accuracy rate , But in GPU It usually adds 10% About the reasoning time . And for SAM, It just needs to pay extra 0.1% Amount of computation , stay ImageNet Image classification tasks can be improved ResNet50-SE 0.5% Of TOP-1 Accuracy rate . The best thing is , It doesn't affect at all GPU Reasoning speed on .

In terms of feature integration , The early practice was to use KIP Connect [51] Or super column [22] Integrating low-level physical features into high-level semantic features . With the popularization of multi-scale prediction methods such as fuzzy neural network , Many lightweight modules have been proposed to integrate different feature pyramids . This type of module includes SfAM[98]、ASFF[48] and BiFPN[77].SfAM The main idea is to use SE The module weights the multi-scale feature map at the channel level .ASFF use Softmax As a point by point hierarchical weighting , Then add feature maps of different scales ;BiFPN Multi input weighted residuals are used to reweight the scale level , Add feature maps of different scales .

In the study of deep learning , Some people focus on looking for good activation . A good activation function can make the gradient propagate more effectively , At the same time, it will not cause too much calculation cost . stay 2010 year ,Nair and Hin-ton [56] Put forward ReLU, To basically solve the tradition tanh and sigmoid The gradient vanishing problem is often encountered in activation functions . And then ,LReLU [54],PReLU [24],ReLU6 [28], A linear unit of scale index （SELU）[35],Swish [59],hard-Swish [27] and Mish [55] etc. , They've also solved the problem of gradients . LReLU and PReLU The main purpose of this paper is to solve the problem when the output is less than zero ReLU The gradient of zero is zero . as for ReLU6 and hard-Swish, They are designed specifically for quantitative networks . In order to self normalize the neural network , Put forward SELU Activate functions to meet this goal . One thing to pay attention to is ,Swish and Mishare Both have continuous and distinguishable activation functions .

In deep learning based object detection, the commonly used post-processing method is NMS, It can be used to filter things that cannot predict the same object BBox, Only candidates with higher response speed are reserved BBox. NMS The improved method is consistent with the method of optimizing the objective function . NMS The context of the proposed method is not considered , therefore Girshicket wait forsomeone . [19] stay R-CNN The classification confidence score is added as a reference , And according to the order of confidence scores , From high to low, greedy NMS. For soft network management system [1], There's a question to consider , That is, the occlusion of objects may lead to the presence of IoU Scores of greedy network management system confidence score decreased . DIoU NMS [99] The way developers think is in softNMS The center distance information is added to BBox In the screening process . It is worth mentioning that , Since the above post-processing methods do not directly involve the captured image function , Therefore, post-processing is no longer needed in the subsequent development of the anchoring free method .

3、 Method （Methodology）

Its basic goal is to operate the neural network quickly in the production system , And optimize for parallel computing , It's not a low computational theoretical indicator （BFLOP）. We offer two options for real-time neural networks ：

about GPU, We use a small number of groups （1-8） Convolution layer ：CSPResNeXt50 / CSPDarknet53
about VPU- We use packet convolution , But we don't use squeezing and excitation anymore （SE） block - In particular, this includes the following models ：EfficientNet-lite / MixNet [76] / GhostNet [21] / Mo-bileNetV3

3.1 Selection of architecture

Our goal is to input network resolution , Convolution layers , Number of parameters （filtersize2 filter passageway / Group ） And layer output （ filter ） Find the best balance between numbers . for example , A lot of research shows that , stay ILSVRC2012（ImageNet） The object classification aspect of the dataset ,CSPResNext50 Than CSPDarknet53 Better . However , contrary , On the test MS COCO For objects on a dataset ,CSPDarknet53 Than CSPResNext50 Better .

The next goal is to select other blocks from different backbone levels for different detector levels to increase the receiver field and the best way to aggregate parameters ： FPN,PAN,ASFF,BiFPN.

The reference model with the best classification is not always the best for the detector . Compared with classifiers , The detector needs to meet the following conditions ：

Greater network input , Used to detect small targets
More layers - To get a larger receptive field to cover the larger input image
More parameters - In order to enhance the ability to detect multiple objects of different sizes from a single image

Suppose you can choose to accept a larger field ( The number of convolution layers is 3×3) And the model with more parameters as the backbone .** surface 1 Shows CSPResNeXt50、CSPDarknet53 and Effi-cientNet B3 Information about .**CSPResNext50 Contains only 16 Convolution layers 3×3,425×425 Feel the field and 20.6M Parameters , and CSPDarknet53 contain 29 Convolution layers 3×3,725×725 Feel the field and 27.6M Parameters . This theory proves that , Plus a lot of our experiments , indicate CSPDarknet53 Neural network is Both serve as the backbone of the detector Of The best model .

The effects of different sizes of receptive fields on the detection results are as follows ：

Maximum object size - Allows you to view the entire object
Maximum network size - Allows you to view the context around the object
Over the size of the network - Increase the connection between the image point and the final activation

We are CSPDarknet53 Add SPP block , Because it significantly increases the receptive field , Isolate the most important contextual features , And it almost doesn't slow down network operation . We use PANET Instead of YOLOv3 Used in FPN As a parameter aggregation method for different bone levels , For different detector levels .

Last , We chose CSPDarknet53 The trunk 、SPP Add on modules 、PANET Path aggregation Neck and YOLOv3( Anchor based ) Head as YOLOv4 Architecture of .

future , We plan to significantly expand the detector's gift bag （BoF） The content of , In theory , It can solve some problems and improve the accuracy of the detector , And check the effect of each function in turn by experiment .

We don't use cross GPU Batch standardization （CGBNor SyncBN） Or expensive special equipment . This makes it possible for anyone to reproduce our latest work on a traditional graphics processor , for example GTX 1080Ti or RTX2080Ti.

3.2 Selection of BoF and BoS

In order to improve target detection training ,CNN The following methods are usually used :

Activate ：ReLU,leaky-ReLU,parameter-ReLU,ReLU6,SELU,Swish or Mish
Bounding box regression loss ：MSE,IoU,GIoU,CIoU,DIoU
Data to enhance ：CutOut,MixUp,CutMix
Regularization method ：DropOut, DropPath [36],Spatial DropOut [79] or DropBlock
The network activation is normalized by means of mean and variance ：Batch Normalization (BN) [32],Cross-GPU Batch Normalization (CGBN or SyncBN)[93], Filter Response Normalization (FRN) [70], orCross-Iteration Batch Normalization (CBN) [89]
Cross connect ：Residual connections, Weightedresidual connections, Multi-input weighted residualconnections, or Cross stage partial connections (CSP)

As for the training activation function , because PReLU and SELU Harder to train , also ReLU6 It's designed specifically for quantitative networks , So we removed the above activation function from the candidate list . In the re quantification method , Release Drop-Block People who compare their own methods with other methods in detail , And its regularization method has won a lot of . therefore , We did not hesitate to choose DropBlock As our regularization method . As for the choice of standardized methods , Because we focus on using only one GPU Training strategies , So don't think about syncBN.

3.3 Additional improvements

In order to make the detector designed more suitable for single GPU Training on , We did the following Additional design and improvements ：

We introduce a new data enhancement method ：Mosaic, and Self-Adversarial Training (SAT)
In the application of genetic algorithm , We choose the optimal hyperparameter
We modified some of the existing methods , Make our design suitable for effective training and testing - The improved SAM, The improved PAN , And standardization across small batches （CmBN）

Mosaic It's a new hybrid 4 Data enhancement method of training images . So four different contexts are mixed , and CutMix Only mixed 2 Kind of .

This allows detection of objects outside their normal context . Besides , Batch normalization from... On each layer 4 Different image calculation activation Statistics . This greatly reduces the need for large mini-batch-size The needs of .

Self confrontation training (SAT) It also represents a new data enhancement technology , It runs in two forward and backward phases . In the first phase , The neural network changes the original image instead of the network weight . In this way , The neural network performs an adversarial attack on itself , Change original image , In order to create deception without the desired object on the image . In the second phase , Training neural network , Detect the target on the modified image in the normal way .

CmBN Express CBN Modified version of , Pictured 4 Shown , It is defined as cross small batch normalization (Cross mini-Batch Normalization,CMBN). This collects statistics only between small batches within a single batch .

We will SAM From space attention to point attention , And will PAN The quick connection of the is replaced by series connection , Pictured 5 Sum graph 6 Shown .

3.4 YOLOv4

In this section , We will introduce in detail YOLOv4 The details of the .

YOLOv4 The composition of ：

YOLOv4 Use ：

4、Experiments

We are ImageNet(ILSVRC 2012 Val) The effects of different training techniques on classifier accuracy are tested on the dataset , And then in MS Coco(test-dev 2017) The influence of different training techniques on the detector accuracy is tested on the data set .

4.1 Experimental setup

stay ImageNet In the experiment of image classification , The default parameters are as follows ： The training steps are 800 Wanbu ;batch size and mini-batch size Respectively 128 and 32; Using polynomial decay learning rate scheduling strategy , The initial learning rate is 0.1; The number of preheating steps is 1000 Step ; The momentum and weight are set to 0.9 and 0.005. All of us BoS Experiments all use the same hyper parameters as the default settings , And in BoF In the experiments , We added 50% Training steps for . stay BoF In the experiments , We verified MixUp、CutMix、Mosaic、Bluring Data enhancement and label smoothing regularization methods . stay BoS In the experiments , We compared LReLU、SWISH and MISHISH The effect of the activation function . All experiments were conducted using 1080Ti or 2080Ti GPU Training .

stay MS COCO In the target detection experiment , The default parameters are as follows ： The training steps are 500,500; Step decay learning rate scheduling strategy is adopted , The initial learning rate is 0.01, stay 400,000 Step sum 450,000 Step by factor 0.1; Momentum attenuation and weight attenuation are set to 0.9 and 0.0005. All architectures use a single GPU To execute a batch with a size of 64 Multi scale training for , And the small batch size is 8 or 4, It depends on the architecture and GPU Memory limit . In addition to using genetic algorithm to search for super parameters , The rest of the experiments were set by default . Genetic algorithms use YOLOv3-SPP The algorithm has GIoU Training with loss , stay 300 Search for Min-Val5k Set . We use the search learning rate 0.00261, momentum 0.949,IOU Threshold assignment ground truth 0.213, Loss normalization 0.07% Genetic algorithm experiment . We've verified a lot of BoF Algorithm , Including mesh sensitivity elimination 、moSAIC Data to enhance 、IOU threshold 、 Genetic algorithm (ga) 、 Class label smoothing 、 Cross small batch normalization 、 Self confrontation training 、 Cosine annealing scheduler 、 Dynamic small batch size 、DropBlock、 Optimize anchor 、 Different types of IOU Loss . We're still different BoS We did experiments on , Include MISH、SPP、SAM、RFB、BiFPN and Gaus-Sian YOLO[8]. For all the experiments , We only use one GPU Training , Therefore, there is no use such as optimizing multiple GPU Of syncBN Technology like that .

4.2 Influence of different features on Classifier training

First , We study the influence of different features on classifier training ; To be specific , The effect of class label smoothing , The impact of different data expansion technologies , Bilateral ambiguity , blend ,CutMix And mosaic , Pictured 7 Shown , And the impact of different activities , Such as Leaky-relu( By default ),SWISH and MISH.

In our experiment , As shown in the table 2 Shown , By introducing such as ：CutMix and Mosaic Data to enhance 、 Class label smoothing and Mish Activation, etc , It improves the accuracy of the classifier . therefore , What we use for classifier training BoF-Backbone(Bag Of Freebies) Including the following ：CutMix and Mosaic Data enhancement and class label smoothing . Besides , We also use MISH Activate as a supplementary option , As shown in the table 2 And table 3 Shown .

4.3 Influence of different features on Detector training

Further research involves different free bags (BOF The detector ) The effect on the accuracy of detector training , As shown in the table 4 Shown . We're not going to affect FPS To improve the accuracy of the detector under different characteristics , Significantly expanded BOF list ：

Further research on different professional bags (BOS detector ) The influence on the training accuracy of detector , Include PAN、RFB、SAM、 gaussian YOLO(G) and ASFF, As shown in the table 5 Shown . In our experiment , When using SPP、PAN and SAM when , The detector has the best performance .

4.4 Influence of different backbones and pretrained weightings on Detector training

further , We study the influence of different backbone models on the detector accuracy , As shown in the table 6 Shown . Please note that , Models with the best classification accuracy are not always the best in terms of detector accuracy .

First , Despite the CSPDarknet53 The model compares , Trained in different functions CSPResNeXt-50 The classification accuracy of the model is higher , but CSPDarknet53 The model shows higher accuracy in object detection .

secondly , Use BoF and Mish Conduct CSPResNeXt50 Classifier training can improve the accuracy of classification , However, further application of these pre training weights to detector training will reduce the detector accuracy . however , take BoF and Mish be used for CSPDarknet53 The classifier and the pre trained detector can improve the accuracy of the classifier training . The result is , And CSPResNeXt50 comparison , The trunk CSPDarknet53 More suitable for detectors .

We observed that , Thanks to various improvements ,CSPDarknet53 The model has greater ability to improve detector accuracy .

4.4 Influence of different mini-batch size on Detector training

Last , We analyze the results of the model trained under different minimum batch sizes , Results such as table 7 Shown . From the table 7 In the results shown , We found that we were adding BoF and BoS After training strategy , The minimum batch size has little effect on the detector performance . The result shows that , In the introduction of BoF and BoS after , No need to use expensive GPU Training . let me put it another way , Anyone can only use the traditional GPU To train great detectors .

5、Results

Compared with other state-of-the-art object detectors, the results are shown in the figure 8 Shown .YOLOv4 Better than the fastest in terms of speed and accuracy , For the most accurate detector .

Because different methods use different architectures GPU The reasoning time is verified , We are using Maxwell、Pascal and VoltaArchitecture Architectural GPU Up operation YOLOv4, They are compared with other advanced methods . surface 8 Lists the use of Maxwell GPU Frame rate comparison results of , It can be GTX Titan X(Maxwell) or Tesla M40 GPU. surface 9 Lists the use of Pascal GPU Frame rate comparison results of , It can be Titan X(Pascal)、Titan XP、GTX 1080 Ti or Tesla P100 GPU. surface 10 Lists the use of VoltaGPU Compare the frame rate of the results , It can be Titan Volta, It can also be Tesla V100 GPU.

6、Conclusions

We offer state-of-the-art detectors , Its speed （FPS） And accuracy （MS COCO AP50 … 95 and AP50） Higher than all available alternative detectors . The detector described may be provided with 8-16GB-VRAM The conventional GPU To train and use , This makes it possible for it to be widely used . The original concept of an anchor based detector has proven its feasibility . We've verified a lot of features , Some features are selected to improve the accuracy of classifier and detector . These features can be used as best practices for future research and development .

7、Acknowledgements

The author would like to thank Glenn Jocher Conduct Mosaic The idea of data enhancement , By using genetic algorithm to select super parameters and solve the problem of grid sensitivity https://github.com/ultralytics/yolov3.10.

原网站

版权声明
本文为[SophiaCV]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202181342288606.html