当前位置：网站首页>Introduction to RCNN, fast RCNN and fast RCNN

Introduction to RCNN, fast RCNN and fast RCNN

2022-06-24 08:25:00 【Midnight rain】

RCNN, Fast RCNN, Faster RCNN

RCNN

RCNN It was the first to ConvNet Introduce the algorithm of target detection neighborhood , Different from image classification algorithm , The main task in the field of object detection is not only to classify the image, but also to frame the specific position of the object in the image , A more formal term is , For an input picture , A qualified target detection algorithm should be able to frame the effective targets in the graph （ Categories set during training ） The area , And classify them correctly .
RCNN As a target detection algorithm , It is necessary to complete the two tasks of box selection and classification , The former passes through selective search And neural network position refinement , The latter passes Convnet Feature extraction and SVM obtain . The specific process of training is as follows ：

First, through selective search Method on the input image Get the candidate area ,selective resarch It is an image layering method in traditional image processing , First, generate some initial regions in the image , From the color 、 Combine these areas at angles such as texture , So as to obtain the final image region segmentation .
Network training , Because the labeled data set of target detection is small , So the author adopts the idea of transfer learning , First, in the classification data set ILSVRC2012 Yes AexNet Pre training of image classification , And then remove the last classification layer finetune.finetune Perform multi category tasks when , The input of the network is 1 Image candidate regions obtained in , The output is that the area is some kind of object or background （IOU>0.5 It is recognized as such object ）. Because the size of the candidate regions is different from each other , Therefore, it needs to be transformed to the same size through processing , The author tried to direct resize Methods 、 The method of first expanding and then cutting 、 The method of cutting first and then expanding ¹, The first one works best .
The Internet finetune After completion, the final classification layer is also removed , Use the last FC The feature vector obtained by layer is used as the image representation extracted by the network . Use... For each category CNN Extract the feature vector of the candidate region as input , Training category related SVM To carry out the sentencing task , It is worth noting that the negative sample requirement here is IOU<0.3, Positive samples are required to include complete objects , This is different from the setting of network classification .（ There is a need to retrain SVM In order to ensure that the number of samples is large enough during network training , A relatively loose positive sample definition is used , The classification task follows the strict definition of positive samples , So you need to Train Extra SVM）
because selective resarch The obtained candidate regions often do not fully match the real candidate regions , The author sets up a regression model for each type of sample Fine tuning candidate areas , Use the samples of the corresponding category to pass through Alexnet Of P5 Layer features as input , Output transform coefficients for translation and scaling of candidate regions , So as to obtain the refined candidate area .
The overall process is as follows :

Fast RCNN

Obvious RCNN There are many problems , such as RCNN Training and testing is very time-consuming , Because for each candidate region in a picture, the feature pattern needs to be calculated once , And a picture usually has 2k Candidate areas , This brings great difficulties to the calculation .

SPPNet

To solve this problem , Hekaiming proposed SPPNet,SPPNet Use the original whole picture as input to get a unique feature pattern , Different candidate regions in the original image are actually part of the feature pattern , It can be obtained by coordinate mapping .SPPNet And RCNN The comparison is as follows ：
Insert picture description here

The problem that follows is that of candidate regions size Different restrictions , stay RCNN This problem is obtained by transforming candidate regions , And in the SPPNet By spatial pyramid pooling solve . The limitation of the same size of input in convolutional neural networks does not come from the convolution layer, but from the full connection layer , The convolution layer treats the input equally , The full connection layer requires that each input size must match the layer weight , therefore SPPNet The last pooling layer in the original network is improved into a spatial pyramid pooling layer , It can transform any size feature graph into a fixed size feature vector ². Thus, feature pattern slices of different sizes of candidate regions with different sizes can also be transformed into feature vectors of the same size for subsequent decision . other aspects ,SPPNet No RCNN Make more improvements .
Insert picture description here

Fast RCNN

Besides RCNN The whole process of is divided into three parts , Including training ConvNet Feature extraction 、 Training SVM Area classification 、 Training bounding-box A regressor to fine tune candidate regions , The whole is discrete , You can't train in an end-to-end way .Fast RCNN emerge as the times require , stay SPPNet The three components are integrated on the basis of , Complete feature extraction at the same time 、 Area classification 、 Fine tune the three targets in the candidate area . Its overall framework is as follows :
Insert picture description here

Its input and SPPNet The same is the position of a single image and the candidate area on it , adopt Roi projection Each candidate region is mapped to the region on the feature pattern through the network , That is, the characteristic pattern of the corresponding area itself . The characteristic patterns of different sizes are Roi pooling layer Get a fixed length eigenvector , there ROI pooling layer It is a simplified version of the pyramid pooling layer of space , The spatial pyramid uses grids of different sizes to obtain features at different scales , and Roi pooling layer Only one meshing method is used （e.g. 7×7）. After obtaining the eigenvector , Then the region classification and fine-tuning parameter regression are carried out through two full connection layers , For each input roi, The loss function of training is :
$L(p,u,t^u,v)=L_{cls}(p,u)+\lambda[u\geq 1]L_{loc}(t^u,v)$
among u Is the category to which the area belongs ,p Output probability for classification header , $t^u$ The translation scaling coefficient output for the regression head ,v Is the true value of the coefficient . $L_｛cls｝$ Is the cross loss entropy function , $L_{loc}$ by smooth L1 error .
In addition, the author also uses some other trick, The author thinks that SPPnet Do not update the... Before the space pyramid pool layer Conv Layer weight , But only for the subsequent full connection layer finetune The reason is that the input of the same batch contains... From many different pictures ROI, and ROI It may contain a large receptive field of the original picture , Every time it is calculated in the back propagation ROI These areas need to be calculated , Resulting in a large amount of calculation . So when the author trains , The same batch contains only 2 A picture , Take... From each picture 128 individual ROI, In the forward and backward propagation of the same picture ROI Resources can be shared , It also speeds up the calculation . most important of all , The author proved through experiments that CONV It is necessary to update the parameters of the layer Of , Especially for those with large parameters ConvNet, Can be better improved MAP, But not all CONV It is necessary to train at all levels , Compared with the previous layers, fine adjustment of parameters is of little significance .
In order to further accelerate the calculation speed , The author used truncated SVD Optimize the full connectivity layer , That is, the original full connection weight matrix $W$ Use SVD Decompose into $U,\Sigma_t V^T$ Two parts ³, When the input size is v, The output size is u when , The calculation amount is determined by uv Turned into t(u+v), As if t<min(u+v) Can Reduce the amount of calculation .
Overall speaking ,Fast RCNN The main sharing of is fast and accurate , The end-to-end training of detection network is realized ; Use truncated SVD Speed up the calculation ; Proved conv Validity of layer parameter update .

Faster RCNN

Fast RCNN Although all the steps are integrated after the candidate regions are extracted , However, the initial candidate regions are still extracted by selective search, and SS The method is very time-consuming , It is an important obstacle to real-time target detection . So the author puts forward Region Proposal Network To use convolution neural network to frame and select candidate regions , At the same time RPN Front end network and detection network Fast RCNN Weight sharing is performed to further improve the computing speed of the network in the test phase , So it was born Faster RCNN.

RPN

RPN The depth convolution neural network is used to calculate the initial candidate region , Its input is a picture , The output is the coordinates of many rectangular areas on the graph and the probability that the area contains effective objects . More specifically ,RPN It includes the feature extraction part shared by the front end （ and Fast RCNN share ） And the special area calculation part , The input of the area calculation part is the characteristic pattern of the input picture , Its structure is as follows ：
Insert picture description here

First RPN It is assumed that each position on the input feature pattern can correspond to $k$ Candidate areas , That is, we can get the position of a point on the feature pattern in the original image by inverse down sampling , The candidate area can be obtained by drawing a rectangle around it . At the same time, in order to reflect the multi-scale characteristics , The authors used different sizes , Rectangular boxes of different proportions , In this way, even if the same object is deformed or scaled in the graph , There are still candidate areas that can frame it properly . this $k$ The candidate areas are called anchors , adopt scale Parameters and aspect ratio Parameters , The former defines the pixels of the longest side ; The latter defines the aspect ratio , The four parameters of center coordinates and length and width are enough to determine a rectangle . As shown in the figure below, the difference between the three points in the feature pattern that can be selected in the original drawing is shown scale Candidate areas for parameters ⁴.
Insert picture description here

It is not enough to get the initial candidate region on the feature pattern , We also need to judge whether these boxes are reasonable , Deviation set . Therefore, the author carries out on the input feature pattern 3*3 The eigenvector is obtained by convolution of , At this time, each element of the feature vector can be considered as the feature representation of the surrounding area of the center of a rectangular box . Then the eigenvectors are divided into $2kC\ 1F$ Convolution （ Sort head ,2k dimension 1*1） and $4kC\ 1F$ Convolution （ Back to the head ,4k dimension 1*1）, The classification header passes through sigmoid The function outputs... Centered on the corresponding position k There are... In four rectangular boxes / The probability that there is no effective target , The regression head outputs the... Centered on the corresponding position k A rectangular box position adjustment is required 4 Parameters .
Go through the above steps , We get the corresponding position in the feature pattern k The probability of target objects in candidate regions ( Just whether there are objects , Further classification is still pending to detect network output ). For these candidate areas , The author goes further by removing the outliers（ The test does not delete but clip）、 Use NMS In order to accelerate the convergence of the network, a part of the network is filtered out in the way of de overlapping .RPN The network losses are as follows ：
$L({p_i},t_i)=\frac{1}{N_{cls}}\sum_iL_{cls}(p_i,p_i^*)+\lambda\frac{1}{N_{reg}}\sum_i p_i^*L_{reg}(t_i,t_i^*)$
The former is the cross entropy function of the classification header , The latter is the regression head smooth L1 loss and Fast RCNN The same as that used for fine tuning candidate areas in .

Faster RCNN

Faster RCNN Will be RPN and Fast RCNN Splice together , Shared the feature extraction part of the front end , To train these two networks , The author puts forward three training methods , They are alternating training RPN and Fast RCNN、 Integrated training without considering the back propagation of coordinate fine adjustment 、 The integrated training considers the back propagation of coordinate fine adjustment , Finally, the first mode is adopted . It can be divided into four steps ：

Use a pre trained network to finetune RPN
Use a pre trained network （ No 1 The front end of the training ） and 1 The candidate regions obtained in the training Fast RCNN
be frozen 2 The front end obtained in is used as RPN The front end of the ,finetune RPN Special area calculation section in
be frozen 2 The front end obtained in is used as Fast RNN The front end of the , Use 3 The candidate regions obtained in finetune Fast RCNN The unique detection part in .

Faster RCNN The purpose of the proposal is to speed up , utilize RPN Replace the original SS, meanwhile RPN It can be seen from the multi-scale box selection section in the... That SPPnet Medium ROI projection The idea of , Ablation experiments have also proved the effectiveness of multi-scale . The results are comparable , Use ZFnet As a backbone network , Feeling RPN Compare with SS There's no big advantage , Use VGG As a backbone network, it will be greatly improved , Perhaps the main advantage is that it is fast .（ The code is shown ⁵）