当前位置：网站首页>A brief history from object detection to image segmentation

A brief history from object detection to image segmentation

2022-07-25 18:43:00 【Xiaobai learns vision】

Click on the above “ Xiaobai studies vision ”, Optional plus " Star standard " or “ Roof placement ”

 Heavy dry goods , First time delivery

translator | Little Han

edit | Encore

【 Pan Chuang AI Reading guide 】： This article explains the brief development history from target detection to image segmentation . This can help us better understand or explain our model . Want to get more machine learning 、 Deep learning resources , Welcome to click on the blue words above to follow our official account ： Pan Chuang AI.

Catalog

The goal of this article
2014: R-CNN CNNs Early application in target detection
- understand R-CNN
- Improve borders
2015: Fast R-CNN - Faster and simpler R-CNN
- Fast R-CNN The first thought ：RoI (Region of Interest) Pooling
- Fast R-CNN The second thought ： Combine all models into a network
2016: Faster R-CNN - Accelerate candidate areas
- How are candidate regions generated
2017: Mask R-CNN - Faster R-CNN Extend to pixel level segmentation
- RoiAlign - adjustment RoIPool Make more accurate
Code
- Faster R-CNN
- Mask R-CNN
expectation

since Alex Krizhevsky、Geoff Hinton and Ilya Sutskever stay 2012 Won in ImageNet Since the challenge , Convolutional neural networks （CNN） It has become the gold standard of image classification . In fact, since then ,CNN It has been improved to now ImageNet The degree of surpassing human beings in the challenge ！

Although these results are impressive , But image classification is much simpler than the complexity and diversity of human visual understanding .

In classification , There is usually an image , There is a single target as the focus , The task is to explain what the goal is . But when we observe the world around us, we perform more complex tasks .

The scene we see has many overlapping goals and different backgrounds , We should not only classify these different goals , Also determine the boundaries between them , Differences and relationships .

CNN Can you help us complete such a complex task ？ in other words , Given a more complex image , We can use CNN To recognize different targets and their boundaries in the image ？ just as Ross Girshick And his colleagues have shown in the past few years , The answer is yes .

The goal of this article

Through this article , We will introduce the principles behind some of the main technologies used in object detection and segmentation , And understand how they evolve from one implementation to the next . Special , We will introduce R-CNN（Regional CNN）,CNNs The original application of , And its descendants Fast R-CNN and Faster R-CNN. Last , We will introduce Facebook Research An article published Mask R-CNN, This paper extends this object detection technology to provide pixel level segmentation . The following are the papers cited in this article ：

R-CNN:
https://arxiv.org/abs/1311.2524
Fast R-CNN:
https://arxiv.org/abs/1504.08083
Faster R-CNN:
https://arxiv.org/abs/1506.01497
Mask R-CNN:
https://arxiv.org/abs/1703.06870

2014: R-CNN CNNs Early application in target detection

University of Toronto Hinton Inspired by laboratory research , By the University of California, Berkeley Jitendra Malik The small team led by the professor began to explore an inevitable problem today ：

[Krizhevsky Results of et al ] To what extent can it be extended to target detection ？

The task of target detection is to find different targets in the image and classify them （ As shown in the figure above ）, from Ross Girshick ,Jeff Donahue and Trevor Darrel The team found ,Krizhevsky The result can solve this problem , And pass PASCAL VOC Challenge Test of , This is a kind of similar to ImageNet Target detection challenges . They wrote ：

This paper shows for the first time that , And based on simple HOG Class function system ,CNN Can be in PASCAL VOC Achieve higher target detection performance on .

Now let's understand their architecture ,Regions WithCNNs（R-CNN） How it works .

understand R-CNN

R-CNN The goal of is to get images , And correctly identify the main targets in the image （ Use border （bounding box） Express ） The location of .

Input ： Images
Output ： The bounding box of each target in the image （bounding box） And labels （label）.

But how do we find the location of these bounding boxes ？ R-CNN Follow our intuition -- First, mark many bounding boxes in the image , Then judge whether each bounding box actually corresponds to a target .

R-CNN Use what is called selective search （Selective Search） Methods to create these bounding boxes or candidate areas . At a higher level , Selective search （ As shown in the figure above ） View images through windows of different sizes , And for each size , Try texture , Color or intensity combines adjacent pixels to identify the target .

Once you create some candidate areas ,R-CNN The area will become a standard square size , And pass it to the modified AlexNet（2012 year ImageNet Award submission ）, As shown in the figure above .

stay CNN The last floor of ,R-CNN Add a support vector machine （SVM）, It simply determines whether this is a goal , If so , What is the goal . See page 4 Step .

Improve borders

Now? , Found this target in the border , Can we reduce the bounding box to the actual size of the target ？ The answer is yes , This is it. R-CNN The last step of .R-CNN Simple linear regression of candidate regions , Generate tighter bounding box coordinates to get the final result . The following are the inputs and outputs of this regression model ：

Input ： The sub region of the corresponding target of the image
Output ： New target bounding box in sub region

To sum up ,R-CNN There are several steps ：

Generate a series of candidate frames .
Input the image in the frame into the pre trained AlexNet, Finally through SVM Confirm what target is in the bounding box .
If there is a target in the image , Input the image in the frame into the linear regression model , Output tighter bounding box coordinates .

2015: Fast R-CNN - Faster and simpler R-CNN

R-CNN It runs very well , But it's slower , There are the following reasons ：

Each candidate area of each image should be input into CNN(AlexNet) in （ Each image has about 2000 individual ！）.
Three different models need to be trained separately : Generating image features CNN , The classifier that predicts the target category , Generate a regression model with a tighter bounding box . This makes the model very difficult to train .

stay 2015 year ,R-CNN The first author of Ross Girshick These two problems have been solved , A second algorithm with only a short history is proposed - Fast R-CNN. Review its main ideas .

Fast R-CNN The first thought ：RoI (Region of Interest) Pooling

Girshick To realize CNN There are many repeated candidate areas in each picture , So there are many repetitions CNN Calculation （ about 2000 Time ）. His idea is very simple — Why not make each picture only once CNN Calculate and find a way to make this about 2000 Candidate regions share the calculation results ？

That's exactly what it is. Fast R-CNN The used is called RoIPool（Region of Interest Pooling） The things that were done . RoIPool The core of is to let candidate regions share CNN Result . In the diagram above , For each area CNN The characteristics are all through CNN The feature map is obtained by selecting the corresponding region . Then each area is pooled （ It is usually maximum pooling ）. So we only need to calculate the original image once instead of the previous 2000 Time ！

Fast R-CNN The second thought ： Combine all models into a network

Fast R-CNN The second idea is to CNN, Classifiers and bounding box regressors are placed in one model . Compared with the previous three different models , Image features （CNN）、 classifier （SVM）、 Bounding box （ Return to ）,Fast R-CNN Only one network computing is used .

You can see how it is done in the above figure . Fast R-CNN use SVM The classifier replaces the original CNN At the top of the softmax layer . It also adds a connection with softmax The parallel linear regression layer is used to output the boundary box coordinates . such , All the output required comes from a network ！ The following is the input and output of the overall model ：

Input ： Images with candidate areas
Output ： Target classification and tighter bounding box of each area .

2016: Faster R-CNN - Accelerate candidate areas

Even with these advances ,Fast R-CNN There is still a bottleneck in the process of — candidate region . As you've seen before , The first step to detect the target location is to generate many potential bounding boxes or regions of interest for testing . stay Fast R-CNN in , These areas are using selective search （Selective Search） Created , This is a rather slow process , Is the bottleneck of the whole process .

stay 2015 Mid term , from Shaoqing Ren,Kaiming He,Ross Girshick and Jian Sun The Microsoft research team has found a way , They call it Faster R-CNN The architecture of , It takes little extra time to generate candidate regions .

Faster R-CNN The idea is , The candidate area depends on having passed CNN Calculated image features （ The first step of classification ）. Why not reuse when generating candidate areas CNN Calculate the results and run the selective search algorithm alone ？

actually , That's exactly what it is. Faster R-CNN What the team has achieved . In the diagram above , You can see a single CNN Calculate how to get candidate regions and classifications . such , Just one calculation CNN You can get candidate areas ！ The author wrote ：

Our observation is that , Area based detectors （ Such as Fast R-CNN） The convolution feature map used can also be used to generate candidate regions （ Almost no cost ）.

The input and output of the model ：

Input ： Images （ Candidate areas are not required ）.
Output ： Classification and bounding box coordinates of objects in the image .

How are candidate regions generated

Let's see Faster R-CNN How to generate candidate regions . Faster R-CNN stay CNN A full convolution network is added to the feature of , That is, the candidate regional network （Region Proposal Network）.

The candidate area network is CNN Slide a window on the feature map , Each window outputs k Three possible bounding boxes and predict the quality of each bounding box , Score a point . this k What does a bounding box represent ？

Intuitively , The target in the image should be suitable for some common aspect ratio and size . for example , If you want some rectangular boxes similar to human shapes, you won't see many very thin boxes . Create... In this way k A frame with such a common aspect ratio , be called anchor boxes. For each anchor boxes, Output a bounding box and score each position in the image .

Input and output of candidate area network ：

Input ： CNN Characteristics of figure .
Output ： Every anchor A bounding box . A score of the probability of having a goal in the bounding box .

Then pass each bounding box that may have a target to Fast R-CNN, Generate categories and smaller bounding boxes .

2017: Mask R-CNN - Faster R-CNN Extend to pixel level segmentation

up to now , We've seen it used in many ways CNN To effectively locate different targets with bounding boxes in the image .

Can we further expand this technology to locate the pixels of each target rather than just a bounding box ？ This problem is image segmentation , yes Kaiming He And include Girshick,Facebook AI The research team used Mask R-CNN structure .

image Fast R-CNN and Faster R-CNN equally ,Mask R-CNN Intuitively, it's very direct . Whereas Faster R-CNN The effect in target detection is very good , Can we extend it to pixel level segmentation ？

Mask R-CNN Through to the Faster R-CNN Add branches to complete this operation , This branch outputs a binary mask （binary mask）, The mask indicates whether the pixel is part of the target . As mentioned earlier , Branch （ White in the above figure ） Based on CNN Full convolution network on the feature map of . Here are its inputs and outputs ：

Input ： CNN Characteristics of figure
Output ： Pixels in all positions belonging to the target are 1 And in other places 0（ Called binary mask ） Matrix .

RoiAlign - adjustment RoIPool Make more accurate

When running on the original without modification Faster R-CNN Upper time ,Mask R-CNN The author of realized that RoIPool The area of the selected feature map corresponds to the area of the original image slightly inaccurate . Unlike bounding boxes , Image segmentation requires pixel level characteristics , This naturally leads to inaccuracies .

The author skillfully adjusted RoIPool To solve this problem , Use is called RoIAlign Method makes alignment more accurate .

Imagine , We have a size of 128 * 128 And a size of 25 * 25 Characteristic graph . The feature area we want corresponds to the... In the upper left corner of the original image 15 * 15 The pixel （ See above ）. How do we select these pixels from the feature map ？

Each pixel in the original image corresponds to ~25/128 Pixels . To select from the original image 15 Pixel , We only choose from the feature map 15 * 25 / 128~= 2.93 Pixel .

stay RoIPool in , We round down to select only 2 Pixel , Cause slight dislocation . however , stay RoIAlign in , We don't use rounding . contrary , We use bilinear interpolation to accurately restore 2.93 Pixels correspond to the content of the original image . This largely avoids RoIPool Dislocation caused .

Once these masks are generated ,Mask R-CNN Mask and Faster R-CNN The generated classification and bounding box are combined , Generate more accurate segmentation ：

Code

If you are interested in learning about these algorithms , Here is the relevant code ：

Faster R-CNN

Caffe: https://github.com/rbgirshick/py-faster-rcnn
PyTorch: https://github.com/longcw/faster_rcnn_pytorch
MatLab: https://github.com/ShaoqingRen/faster_rcnn

Mask R-CNN

PyTorch: https://github.com/felixgwu/mask_rcnn_pytorch
TensorFlow: https://github.com/CharlesShang/FastMaskRCNN

expectation

In a short span of 3 years , We've seen how the research community starts from Krizhevsky The original results of et al R-CNN, Finally, until Mask R-CNN Such powerful achievements . Look at... In isolation , image Mask R-CNN Such an achievement looks like an incredible leap of genius that cannot be achieved . However , Through this article , I hope you can see that these advances have been slowly achieved through years of efforts and collaboration . R-CNN,Fast R-CNN,Faster R-CNN And finally Mask R-CNN Every idea raised is not necessarily a qualitative leap , But their combination has produced very remarkable results , Closer to the level of human vision .

What makes me particularly excited is ,R-CNN and Mask R-CNN The time between them is only three years ！ Through increasing attention and support , Whether computer vision can be further improved in the next three years ？

source ：https://blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4

The good news ！

Xiaobai learns visual knowledge about the planet

Open to the outside world

 download 1：OpenCV-Contrib Chinese version of extension module 

 stay 「 Xiaobai studies vision 」 Official account back office reply ： Extension module Chinese course , You can download the first copy of the whole network OpenCV Extension module tutorial Chinese version , Cover expansion module installation 、SFM Algorithm 、 Stereo vision 、 Target tracking 、 Biological vision 、 Super resolution processing and other more than 20 chapters .


 download 2：Python Visual combat project 52 speak 
 stay 「 Xiaobai studies vision 」 Official account back office reply ：Python Visual combat project , You can download, including image segmentation 、 Mask detection 、 Lane line detection 、 Vehicle count 、 Add Eyeliner 、 License plate recognition 、 Character recognition 、 Emotional tests 、 Text content extraction 、 Face recognition, etc 31 A visual combat project , Help fast school computer vision .


 download 3：OpenCV Actual project 20 speak 
 stay 「 Xiaobai studies vision 」 Official account back office reply ：OpenCV Actual project 20 speak , You can download the 20 Based on OpenCV Realization 20 A real project , Realization OpenCV Learn advanced .


 Communication group 

 Welcome to join the official account reader group to communicate with your colleagues , There are SLAM、 3 d visual 、 sensor 、 Autopilot 、 Computational photography 、 testing 、 Division 、 distinguish 、 Medical imaging 、GAN、 Wechat groups such as algorithm competition （ It will be subdivided gradually in the future ）, Please scan the following micro signal clustering , remarks ：” nickname + School / company + Research direction “, for example ：” Zhang San  +  Shanghai Jiaotong University  +  Vision SLAM“. Please note... According to the format , Otherwise, it will not pass . After successful addition, they will be invited to relevant wechat groups according to the research direction . Please do not send ads in the group , Or you'll be invited out , Thanks for your understanding ~

原网站

版权声明
本文为[Xiaobai learns vision]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251837576907.html

当前位置：网站首页>A brief history from object detection to image segmentation

A brief history from object detection to image segmentation

边栏推荐

猜你喜欢

随机推荐