当前位置:网站首页>A brief history from object detection to image segmentation
A brief history from object detection to image segmentation
2022-07-25 18:43:00 【Xiaobai learns vision】
Click on the above “ Xiaobai studies vision ”, Optional plus " Star standard " or “ Roof placement ”
Heavy dry goods , First time delivery translator | Little Han
edit | Encore
【 Pan Chuang AI Reading guide 】: This article explains the brief development history from target detection to image segmentation . This can help us better understand or explain our model . Want to get more machine learning 、 Deep learning resources , Welcome to click on the blue words above to follow our official account : Pan Chuang AI.
Catalog
The goal of this article
2014: R-CNN CNNs Early application in target detection
understand R-CNN
Improve borders
2015: Fast R-CNN - Faster and simpler R-CNN
Fast R-CNN The first thought :RoI (Region of Interest) Pooling
Fast R-CNN The second thought : Combine all models into a network
2016: Faster R-CNN - Accelerate candidate areas
How are candidate regions generated
2017: Mask R-CNN - Faster R-CNN Extend to pixel level segmentation
RoiAlign - adjustment RoIPool Make more accurate
Code
Faster R-CNN
Mask R-CNN
expectation
since Alex Krizhevsky、Geoff Hinton and Ilya Sutskever stay 2012 Won in ImageNet Since the challenge , Convolutional neural networks (CNN) It has become the gold standard of image classification . In fact, since then ,CNN It has been improved to now ImageNet The degree of surpassing human beings in the challenge !

Although these results are impressive , But image classification is much simpler than the complexity and diversity of human visual understanding .

In classification , There is usually an image , There is a single target as the focus , The task is to explain what the goal is . But when we observe the world around us, we perform more complex tasks .

The scene we see has many overlapping goals and different backgrounds , We should not only classify these different goals , Also determine the boundaries between them , Differences and relationships .

CNN Can you help us complete such a complex task ? in other words , Given a more complex image , We can use CNN To recognize different targets and their boundaries in the image ? just as Ross Girshick And his colleagues have shown in the past few years , The answer is yes .
The goal of this article
Through this article , We will introduce the principles behind some of the main technologies used in object detection and segmentation , And understand how they evolve from one implementation to the next . Special , We will introduce R-CNN(Regional CNN),CNNs The original application of , And its descendants Fast R-CNN and Faster R-CNN. Last , We will introduce Facebook Research An article published Mask R-CNN, This paper extends this object detection technology to provide pixel level segmentation . The following are the papers cited in this article :
R-CNN:
https://arxiv.org/abs/1311.2524
Fast R-CNN:
https://arxiv.org/abs/1504.08083
Faster R-CNN:
https://arxiv.org/abs/1506.01497
Mask R-CNN:
https://arxiv.org/abs/1703.06870
2014: R-CNN CNNs Early application in target detection

University of Toronto Hinton Inspired by laboratory research , By the University of California, Berkeley Jitendra Malik The small team led by the professor began to explore an inevitable problem today :
[Krizhevsky Results of et al ] To what extent can it be extended to target detection ?
The task of target detection is to find different targets in the image and classify them ( As shown in the figure above ), from Ross Girshick ,Jeff Donahue and Trevor Darrel The team found ,Krizhevsky The result can solve this problem , And pass PASCAL VOC Challenge Test of , This is a kind of similar to ImageNet Target detection challenges . They wrote :
This paper shows for the first time that , And based on simple HOG Class function system ,CNN Can be in PASCAL VOC Achieve higher target detection performance on .
Now let's understand their architecture ,Regions WithCNNs(R-CNN) How it works .
understand R-CNN
R-CNN The goal of is to get images , And correctly identify the main targets in the image ( Use border (bounding box) Express ) The location of .
Input : Images
Output : The bounding box of each target in the image (bounding box) And labels (label).
But how do we find the location of these bounding boxes ? R-CNN Follow our intuition -- First, mark many bounding boxes in the image , Then judge whether each bounding box actually corresponds to a target .

R-CNN Use what is called selective search (Selective Search) Methods to create these bounding boxes or candidate areas . At a higher level , Selective search ( As shown in the figure above ) View images through windows of different sizes , And for each size , Try texture , Color or intensity combines adjacent pixels to identify the target .

Once you create some candidate areas ,R-CNN The area will become a standard square size , And pass it to the modified AlexNet(2012 year ImageNet Award submission ), As shown in the figure above .
stay CNN The last floor of ,R-CNN Add a support vector machine (SVM), It simply determines whether this is a goal , If so , What is the goal . See page 4 Step .
Improve borders
Now? , Found this target in the border , Can we reduce the bounding box to the actual size of the target ? The answer is yes , This is it. R-CNN The last step of .R-CNN Simple linear regression of candidate regions , Generate tighter bounding box coordinates to get the final result . The following are the inputs and outputs of this regression model :
Input : The sub region of the corresponding target of the image
Output : New target bounding box in sub region
To sum up ,R-CNN There are several steps :
Generate a series of candidate frames .
Input the image in the frame into the pre trained AlexNet, Finally through SVM Confirm what target is in the bounding box .
If there is a target in the image , Input the image in the frame into the linear regression model , Output tighter bounding box coordinates .
2015: Fast R-CNN - Faster and simpler R-CNN

R-CNN It runs very well , But it's slower , There are the following reasons :
Each candidate area of each image should be input into CNN(AlexNet) in ( Each image has about 2000 individual !).
Three different models need to be trained separately : Generating image features CNN , The classifier that predicts the target category , Generate a regression model with a tighter bounding box . This makes the model very difficult to train .
stay 2015 year ,R-CNN The first author of Ross Girshick These two problems have been solved , A second algorithm with only a short history is proposed - Fast R-CNN. Review its main ideas .
Fast R-CNN The first thought :RoI (Region of Interest) Pooling
Girshick To realize CNN There are many repeated candidate areas in each picture , So there are many repetitions CNN Calculation ( about 2000 Time ). His idea is very simple — Why not make each picture only once CNN Calculate and find a way to make this about 2000 Candidate regions share the calculation results ?

That's exactly what it is. Fast R-CNN The used is called RoIPool(Region of Interest Pooling) The things that were done . RoIPool The core of is to let candidate regions share CNN Result . In the diagram above , For each area CNN The characteristics are all through CNN The feature map is obtained by selecting the corresponding region . Then each area is pooled ( It is usually maximum pooling ). So we only need to calculate the original image once instead of the previous 2000 Time !
Fast R-CNN The second thought : Combine all models into a network

Fast R-CNN The second idea is to CNN, Classifiers and bounding box regressors are placed in one model . Compared with the previous three different models , Image features (CNN)、 classifier (SVM)、 Bounding box ( Return to ),Fast R-CNN Only one network computing is used .
You can see how it is done in the above figure . Fast R-CNN use SVM The classifier replaces the original CNN At the top of the softmax layer . It also adds a connection with softmax The parallel linear regression layer is used to output the boundary box coordinates . such , All the output required comes from a network ! The following is the input and output of the overall model :
Input : Images with candidate areas
Output : Target classification and tighter bounding box of each area .
2016: Faster R-CNN - Accelerate candidate areas
Even with these advances ,Fast R-CNN There is still a bottleneck in the process of — candidate region . As you've seen before , The first step to detect the target location is to generate many potential bounding boxes or regions of interest for testing . stay Fast R-CNN in , These areas are using selective search (Selective Search) Created , This is a rather slow process , Is the bottleneck of the whole process .

stay 2015 Mid term , from Shaoqing Ren,Kaiming He,Ross Girshick and Jian Sun The Microsoft research team has found a way , They call it Faster R-CNN The architecture of , It takes little extra time to generate candidate regions .
Faster R-CNN The idea is , The candidate area depends on having passed CNN Calculated image features ( The first step of classification ). Why not reuse when generating candidate areas CNN Calculate the results and run the selective search algorithm alone ?

actually , That's exactly what it is. Faster R-CNN What the team has achieved . In the diagram above , You can see a single CNN Calculate how to get candidate regions and classifications . such , Just one calculation CNN You can get candidate areas ! The author wrote :
Our observation is that , Area based detectors ( Such as Fast R-CNN) The convolution feature map used can also be used to generate candidate regions ( Almost no cost ).
The input and output of the model :
Input : Images ( Candidate areas are not required ).
Output : Classification and bounding box coordinates of objects in the image .
How are candidate regions generated
Let's see Faster R-CNN How to generate candidate regions . Faster R-CNN stay CNN A full convolution network is added to the feature of , That is, the candidate regional network (Region Proposal Network).

The candidate area network is CNN Slide a window on the feature map , Each window outputs k Three possible bounding boxes and predict the quality of each bounding box , Score a point . this k What does a bounding box represent ?

Intuitively , The target in the image should be suitable for some common aspect ratio and size . for example , If you want some rectangular boxes similar to human shapes, you won't see many very thin boxes . Create... In this way k A frame with such a common aspect ratio , be called anchor boxes. For each anchor boxes, Output a bounding box and score each position in the image .
Input and output of candidate area network :
Input : CNN Characteristics of figure .
Output : Every anchor A bounding box . A score of the probability of having a goal in the bounding box .
Then pass each bounding box that may have a target to Fast R-CNN, Generate categories and smaller bounding boxes .
2017: Mask R-CNN - Faster R-CNN Extend to pixel level segmentation

up to now , We've seen it used in many ways CNN To effectively locate different targets with bounding boxes in the image .
Can we further expand this technology to locate the pixels of each target rather than just a bounding box ? This problem is image segmentation , yes Kaiming He And include Girshick,Facebook AI The research team used Mask R-CNN structure .

image Fast R-CNN and Faster R-CNN equally ,Mask R-CNN Intuitively, it's very direct . Whereas Faster R-CNN The effect in target detection is very good , Can we extend it to pixel level segmentation ?

Mask R-CNN Through to the Faster R-CNN Add branches to complete this operation , This branch outputs a binary mask (binary mask), The mask indicates whether the pixel is part of the target . As mentioned earlier , Branch ( White in the above figure ) Based on CNN Full convolution network on the feature map of . Here are its inputs and outputs :
Input : CNN Characteristics of figure
Output : Pixels in all positions belonging to the target are 1 And in other places 0( Called binary mask ) Matrix .
RoiAlign - adjustment RoIPool Make more accurate

When running on the original without modification Faster R-CNN Upper time ,Mask R-CNN The author of realized that RoIPool The area of the selected feature map corresponds to the area of the original image slightly inaccurate . Unlike bounding boxes , Image segmentation requires pixel level characteristics , This naturally leads to inaccuracies .
The author skillfully adjusted RoIPool To solve this problem , Use is called RoIAlign Method makes alignment more accurate .

Imagine , We have a size of 128 * 128 And a size of 25 * 25 Characteristic graph . The feature area we want corresponds to the... In the upper left corner of the original image 15 * 15 The pixel ( See above ). How do we select these pixels from the feature map ?
Each pixel in the original image corresponds to ~25/128 Pixels . To select from the original image 15 Pixel , We only choose from the feature map 15 * 25 / 128~= 2.93 Pixel .
stay RoIPool in , We round down to select only 2 Pixel , Cause slight dislocation . however , stay RoIAlign in , We don't use rounding . contrary , We use bilinear interpolation to accurately restore 2.93 Pixels correspond to the content of the original image . This largely avoids RoIPool Dislocation caused .
Once these masks are generated ,Mask R-CNN Mask and Faster R-CNN The generated classification and bounding box are combined , Generate more accurate segmentation :

Code
If you are interested in learning about these algorithms , Here is the relevant code :
Faster R-CNN
Caffe: https://github.com/rbgirshick/py-faster-rcnn
PyTorch: https://github.com/longcw/faster_rcnn_pytorch
MatLab: https://github.com/ShaoqingRen/faster_rcnn
Mask R-CNN
PyTorch: https://github.com/felixgwu/mask_rcnn_pytorch
TensorFlow: https://github.com/CharlesShang/FastMaskRCNN
expectation
In a short span of 3 years , We've seen how the research community starts from Krizhevsky The original results of et al R-CNN, Finally, until Mask R-CNN Such powerful achievements . Look at... In isolation , image Mask R-CNN Such an achievement looks like an incredible leap of genius that cannot be achieved . However , Through this article , I hope you can see that these advances have been slowly achieved through years of efforts and collaboration . R-CNN,Fast R-CNN,Faster R-CNN And finally Mask R-CNN Every idea raised is not necessarily a qualitative leap , But their combination has produced very remarkable results , Closer to the level of human vision .
What makes me particularly excited is ,R-CNN and Mask R-CNN The time between them is only three years ! Through increasing attention and support , Whether computer vision can be further improved in the next three years ?
source :https://blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4
The good news !
Xiaobai learns visual knowledge about the planet
Open to the outside world

download 1:OpenCV-Contrib Chinese version of extension module
stay 「 Xiaobai studies vision 」 Official account back office reply : Extension module Chinese course , You can download the first copy of the whole network OpenCV Extension module tutorial Chinese version , Cover expansion module installation 、SFM Algorithm 、 Stereo vision 、 Target tracking 、 Biological vision 、 Super resolution processing and other more than 20 chapters .
download 2:Python Visual combat project 52 speak
stay 「 Xiaobai studies vision 」 Official account back office reply :Python Visual combat project , You can download, including image segmentation 、 Mask detection 、 Lane line detection 、 Vehicle count 、 Add Eyeliner 、 License plate recognition 、 Character recognition 、 Emotional tests 、 Text content extraction 、 Face recognition, etc 31 A visual combat project , Help fast school computer vision .
download 3:OpenCV Actual project 20 speak
stay 「 Xiaobai studies vision 」 Official account back office reply :OpenCV Actual project 20 speak , You can download the 20 Based on OpenCV Realization 20 A real project , Realization OpenCV Learn advanced .
Communication group
Welcome to join the official account reader group to communicate with your colleagues , There are SLAM、 3 d visual 、 sensor 、 Autopilot 、 Computational photography 、 testing 、 Division 、 distinguish 、 Medical imaging 、GAN、 Wechat groups such as algorithm competition ( It will be subdivided gradually in the future ), Please scan the following micro signal clustering , remarks :” nickname + School / company + Research direction “, for example :” Zhang San + Shanghai Jiaotong University + Vision SLAM“. Please note... According to the format , Otherwise, it will not pass . After successful addition, they will be invited to relevant wechat groups according to the research direction . Please do not send ads in the group , Or you'll be invited out , Thanks for your understanding ~边栏推荐
- The auction house is a VC, and the first time it makes a move, it throws a Web3
- Project: serial port receiving RAM storage TFT display (complete design)
- 如何创建一个有效的帮助文档?
- Experimental reproduction of image classification (reasoning only) based on caffe resnet-50 network
- [noi2015] package manager
- [QNX hypervisor 2.2 user manual]9.5 dump
- [haoi2015] tree operation
- 解决You can change this value on the server by setting the ‘max_allowed_packet‘ variable报错
- [QNX hypervisor 2.2 user manual]9.4 dryrun
- 「跨链互连智能合约」解读
猜你喜欢

Pixel2Mesh从单个RGB图像生成三维网格ECCV2018

《21天精通TypeScript-4》-类型推断与语义检查

There was an error while marking a file for deletion

Circulaindicator component, which makes the indicator style more diversified

什么是hpaPaaS平台?

ESP32 S3 vscode+idf搭建

Northeast people know sexiness best

MySQL子查询篇(精选20道子查询练习题)

Software testing -- common testing tools

Powell's function of Ceres
随机推荐
How to create an effective help document?
This is a quick look-up table of machine & deep learning code
Process communication (Systemv communication mode: shared memory, message queue, semaphore)
Nc68 jumping steps
Add a little surprise to life and be a prototype designer of creative life -- sharing with X contestants in the programming challenge
There was an error while marking a file for deletion
Typescript对象代理器Proxy使用
Common file operations
Paper revision reply 1
可视化模型网络连接
Optimistic lock resolution
[translation] logstash, fluent, fluent bit, or vector? How to choose the right open source log collector
[QNX Hypervisor 2.2用户手册]9.5 dump
字符串函数和内存函数(二)
Typescript反射对象Reflect使用
进程通信(SystemV通信方式:共享内存,消息队列,信号量)
pd.melt() vs reshape2::melt()
Yyds dry inventory interview must brush top101: reverse linked list
曾拿2亿融资,昔日网红书店如今全国闭店,60家店仅剩3家
大厂云业务调整,新一轮战争转向