当前位置:网站首页>Introduction to RCNN, fast RCNN and fast RCNN

Introduction to RCNN, fast RCNN and fast RCNN

2022-06-24 08:25:00 Midnight rain

RCNN, Fast RCNN, Faster RCNN

RCNN

 RCNN It was the first to ConvNet Introduce the algorithm of target detection neighborhood , Different from image classification algorithm , The main task in the field of object detection is not only to classify the image, but also to frame the specific position of the object in the image , A more formal term is , For an input picture , A qualified target detection algorithm should be able to frame the effective targets in the graph ( Categories set during training ) The area , And classify them correctly .
 RCNN As a target detection algorithm , It is necessary to complete the two tasks of box selection and classification , The former passes through selective search And neural network position refinement , The latter passes Convnet Feature extraction and SVM obtain . The specific process of training is as follows :

  1. First, through selective search Method on the input image Get the candidate area ,selective resarch It is an image layering method in traditional image processing , First, generate some initial regions in the image , From the color 、 Combine these areas at angles such as texture , So as to obtain the final image region segmentation .
  2. Network training , Because the labeled data set of target detection is small , So the author adopts the idea of transfer learning , First, in the classification data set ILSVRC2012 Yes AexNet Pre training of image classification , And then remove the last classification layer finetune.finetune Perform multi category tasks when , The input of the network is 1 Image candidate regions obtained in , The output is that the area is some kind of object or background (IOU>0.5 It is recognized as such object ). Because the size of the candidate regions is different from each other , Therefore, it needs to be transformed to the same size through processing , The author tried to direct resize Methods 、 The method of first expanding and then cutting 、 The method of cutting first and then expanding 1, The first one works best .
  3. The Internet finetune After completion, the final classification layer is also removed , Use the last FC The feature vector obtained by layer is used as the image representation extracted by the network . Use... For each category CNN Extract the feature vector of the candidate region as input , Training category related SVM To carry out the sentencing task , It is worth noting that the negative sample requirement here is IOU<0.3, Positive samples are required to include complete objects , This is different from the setting of network classification .( There is a need to retrain SVM In order to ensure that the number of samples is large enough during network training , A relatively loose positive sample definition is used , The classification task follows the strict definition of positive samples , So you need to Train Extra SVM)
  4. because selective resarch The obtained candidate regions often do not fully match the real candidate regions , The author sets up a regression model for each type of sample Fine tuning candidate areas , Use the samples of the corresponding category to pass through Alexnet Of P5 Layer features as input , Output transform coefficients for translation and scaling of candidate regions , So as to obtain the refined candidate area .
    The overall process is as follows :
     Insert picture description here

Fast RCNN

  Obvious RCNN There are many problems , such as RCNN Training and testing is very time-consuming , Because for each candidate region in a picture, the feature pattern needs to be calculated once , And a picture usually has 2k Candidate areas , This brings great difficulties to the calculation .

SPPNet

  To solve this problem , Hekaiming proposed SPPNet,SPPNet Use the original whole picture as input to get a unique feature pattern , Different candidate regions in the original image are actually part of the feature pattern , It can be obtained by coordinate mapping .SPPNet And RCNN The comparison is as follows :
 Insert picture description here

The problem that follows is that of candidate regions size Different restrictions , stay RCNN This problem is obtained by transforming candidate regions , And in the SPPNet By spatial pyramid pooling solve . The limitation of the same size of input in convolutional neural networks does not come from the convolution layer, but from the full connection layer , The convolution layer treats the input equally , The full connection layer requires that each input size must match the layer weight , therefore SPPNet The last pooling layer in the original network is improved into a spatial pyramid pooling layer , It can transform any size feature graph into a fixed size feature vector 2. Thus, feature pattern slices of different sizes of candidate regions with different sizes can also be transformed into feature vectors of the same size for subsequent decision . other aspects ,SPPNet No RCNN Make more improvements .
 Insert picture description here

Fast RCNN

  Besides RCNN The whole process of is divided into three parts , Including training ConvNet Feature extraction 、 Training SVM Area classification 、 Training bounding-box A regressor to fine tune candidate regions , The whole is discrete , You can't train in an end-to-end way .Fast RCNN emerge as the times require , stay SPPNet The three components are integrated on the basis of , Complete feature extraction at the same time 、 Area classification 、 Fine tune the three targets in the candidate area . Its overall framework is as follows :
 Insert picture description here

Its input and SPPNet The same is the position of a single image and the candidate area on it , adopt Roi projection Each candidate region is mapped to the region on the feature pattern through the network , That is, the characteristic pattern of the corresponding area itself . The characteristic patterns of different sizes are Roi pooling layer Get a fixed length eigenvector , there ROI pooling layer It is a simplified version of the pyramid pooling layer of space , The spatial pyramid uses grids of different sizes to obtain features at different scales , and Roi pooling layer Only one meshing method is used (e.g. 7×7). After obtaining the eigenvector , Then the region classification and fine-tuning parameter regression are carried out through two full connection layers , For each input roi, The loss function of training is :
L ( p , u , t u , v ) = L c l s ( p , u ) + λ [ u ≥ 1 ] L l o c ( t u , v ) L(p,u,t^u,v)=L_{cls}(p,u)+\lambda[u\geq 1]L_{loc}(t^u,v) L(p,u,tu,v)=Lcls(p,u)+λ[u1]Lloc(tu,v)
among u Is the category to which the area belongs ,p Output probability for classification header , t u t^u tu The translation scaling coefficient output for the regression head ,v Is the true value of the coefficient . L { c l s } L_{cls} Lcls Is the cross loss entropy function , L l o c L_{loc} Lloc by smooth L1 error .
  In addition, the author also uses some other trick, The author thinks that SPPnet Do not update the... Before the space pyramid pool layer Conv Layer weight , But only for the subsequent full connection layer finetune The reason is that the input of the same batch contains... From many different pictures ROI, and ROI It may contain a large receptive field of the original picture , Every time it is calculated in the back propagation ROI These areas need to be calculated , Resulting in a large amount of calculation . So when the author trains , The same batch contains only 2 A picture , Take... From each picture 128 individual ROI, In the forward and backward propagation of the same picture ROI Resources can be shared , It also speeds up the calculation . most important of all , The author proved through experiments that CONV It is necessary to update the parameters of the layer Of , Especially for those with large parameters ConvNet, Can be better improved MAP, But not all CONV It is necessary to train at all levels , Compared with the previous layers, fine adjustment of parameters is of little significance .
  In order to further accelerate the calculation speed , The author used truncated SVD Optimize the full connectivity layer , That is, the original full connection weight matrix W W W Use SVD Decompose into U , Σ t V T U,\Sigma_t V^T U,ΣtVT Two parts 3, When the input size is v, The output size is u when , The calculation amount is determined by uv Turned into t(u+v), As if t<min(u+v) Can Reduce the amount of calculation .
  Overall speaking ,Fast RCNN The main sharing of is fast and accurate , The end-to-end training of detection network is realized ; Use truncated SVD Speed up the calculation ; Proved conv Validity of layer parameter update .

Faster RCNN

 Fast RCNN Although all the steps are integrated after the candidate regions are extracted , However, the initial candidate regions are still extracted by selective search, and SS The method is very time-consuming , It is an important obstacle to real-time target detection . So the author puts forward Region Proposal Network To use convolution neural network to frame and select candidate regions , At the same time RPN Front end network and detection network Fast RCNN Weight sharing is performed to further improve the computing speed of the network in the test phase , So it was born Faster RCNN.

RPN

 RPN The depth convolution neural network is used to calculate the initial candidate region , Its input is a picture , The output is the coordinates of many rectangular areas on the graph and the probability that the area contains effective objects . More specifically ,RPN It includes the feature extraction part shared by the front end ( and Fast RCNN share ) And the special area calculation part , The input of the area calculation part is the characteristic pattern of the input picture , Its structure is as follows :
 Insert picture description here

  First RPN It is assumed that each position on the input feature pattern can correspond to k k k Candidate areas , That is, we can get the position of a point on the feature pattern in the original image by inverse down sampling , The candidate area can be obtained by drawing a rectangle around it . At the same time, in order to reflect the multi-scale characteristics , The authors used different sizes , Rectangular boxes of different proportions , In this way, even if the same object is deformed or scaled in the graph , There are still candidate areas that can frame it properly . this k k k The candidate areas are called anchors , adopt scale Parameters and aspect ratio Parameters , The former defines the pixels of the longest side ; The latter defines the aspect ratio , The four parameters of center coordinates and length and width are enough to determine a rectangle . As shown in the figure below, the difference between the three points in the feature pattern that can be selected in the original drawing is shown scale Candidate areas for parameters 4.
 Insert picture description here

  It is not enough to get the initial candidate region on the feature pattern , We also need to judge whether these boxes are reasonable , Deviation set . Therefore, the author carries out on the input feature pattern 3*3 The eigenvector is obtained by convolution of , At this time, each element of the feature vector can be considered as the feature representation of the surrounding area of the center of a rectangular box . Then the eigenvectors are divided into 2 k C   1 F 2kC\ 1F 2kC 1F Convolution ( Sort head ,2k dimension 1*1) and 4 k C   1 F 4kC\ 1F 4kC 1F Convolution ( Back to the head ,4k dimension 1*1), The classification header passes through sigmoid The function outputs... Centered on the corresponding position k There are... In four rectangular boxes / The probability that there is no effective target , The regression head outputs the... Centered on the corresponding position k A rectangular box position adjustment is required 4 Parameters .
  Go through the above steps , We get the corresponding position in the feature pattern k The probability of target objects in candidate regions ( Just whether there are objects , Further classification is still pending to detect network output ). For these candidate areas , The author goes further by removing the outliers( The test does not delete but clip)、 Use NMS In order to accelerate the convergence of the network, a part of the network is filtered out in the way of de overlapping .RPN The network losses are as follows :
L ( p i , t i ) = 1 N c l s ∑ i L c l s ( p i , p i ∗ ) + λ 1 N r e g ∑ i p i ∗ L r e g ( t i , t i ∗ ) L({p_i},t_i)=\frac{1}{N_{cls}}\sum_iL_{cls}(p_i,p_i^*)+\lambda\frac{1}{N_{reg}}\sum_i p_i^*L_{reg}(t_i,t_i^*) L(pi,ti)=Ncls1iLcls(pi,pi)+λNreg1ipiLreg(ti,ti)
The former is the cross entropy function of the classification header , The latter is the regression head smooth L1 loss and Fast RCNN The same as that used for fine tuning candidate areas in .

Faster RCNN

 Faster RCNN Will be RPN and Fast RCNN Splice together , Shared the feature extraction part of the front end , To train these two networks , The author puts forward three training methods , They are alternating training RPN and Fast RCNN、 Integrated training without considering the back propagation of coordinate fine adjustment 、 The integrated training considers the back propagation of coordinate fine adjustment , Finally, the first mode is adopted . It can be divided into four steps :

  1. Use a pre trained network to finetune RPN
  2. Use a pre trained network ( No 1 The front end of the training ) and 1 The candidate regions obtained in the training Fast RCNN
  3. be frozen 2 The front end obtained in is used as RPN The front end of the ,finetune RPN Special area calculation section in
  4. be frozen 2 The front end obtained in is used as Fast RNN The front end of the , Use 3 The candidate regions obtained in finetune Fast RCNN The unique detection part in .

 Faster RCNN The purpose of the proposal is to speed up , utilize RPN Replace the original SS, meanwhile RPN It can be seen from the multi-scale box selection section in the... That SPPnet Medium ROI projection The idea of , Ablation experiments have also proved the effectiveness of multi-scale . The results are comparable , Use ZFnet As a backbone network , Feeling RPN Compare with SS There's no big advantage , Use VGG As a backbone network, it will be greatly improved , Perhaps the main advantage is that it is fast .( The code is shown 5

Reference resources


  1. RCNN- take CNN A pioneering work of introducing target detection

  2. 【 object detection 】SPPnet Detailed explanation of the paper (Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition)

  3. 【 object detection 】Fast R-CNN Detailed explanation of the paper (Fast R-CNN)

  4. RPN:Region Proposal Networks ( Regional candidate network )

  5. simple-faster-rcnn-pytorch

原网站

版权声明
本文为[Midnight rain]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206240549480475.html