当前位置:网站首页>Neighbor vote: use proximity voting to optimize monocular 3D target detection (ACM mm2021)
Neighbor vote: use proximity voting to optimize monocular 3D target detection (ACM mm2021)
2022-06-24 06:03:00 【3D vision workshop】
name :Neighbor-Vote: Improving Monocular 3D Object Detection through Neighbor Distance Voting
link :https://arxiv.org/pdf/2107.02493.pdf
Abstract : As the camera is used more and more widely in new application fields such as automatic driving , The monocular image is 3D Object detection has become an important task of visual scene understanding . Monocular 3D The latest development of object detection depends on the generation of pseudo point cloud , That is, monocular depth estimation , Promote 2D pixel points to Pseudo 3D points . However , The depth estimation accuracy of monocular image is not high , It will inevitably cause the position of the pseudo point cloud in the target to shift . therefore , The predicted frame may have inaccurate position and shape deformation . In this paper , This paper presents a novel neighbor voting method , The neighbor prediction is helpful to improve target detection from severely deformed pseudo point clouds . To be specific , Each feature point forms its own prediction , Then vote to build “ Consensus ”. In this way , This paper can effectively combine neighbor prediction with local prediction , Achieve more accurate 3D testing . To further zoom in ROI The difference between a pseudo point and a background point , In addition, this paper will discuss 2D Of foreground pixels ROI The prediction score is encoded into the corresponding pseudo - 3D light . In this paper KITTI The proposed method is verified on benchmark , The bird's-eye inspection results on the verification set are better than the current SOTA, Especially for “ difficult ” Level detection .
1. introduction
3D Target detection depends on understanding 3D Application of context in the world ( For example, autonomous driving ) One of the most important tasks in . There have been many point cloud based 3D Target detection algorithm . Although these methods have achieved excellent performance , However , Lidar is still too expensive , It can't be equipped on every car . therefore , Cheaper alternatives are more popular , Especially the camera , Because their prices are low , High frame rate .
On the other hand , Due to the lack of depth information , stay RGB Images , Especially on monocular images 3D testing , Still a daunting challenge . To solve this challenge , There are already methods : First, depth information is estimated from monocular images , And then 2D Pixel conversion to pseudo 3D. And then 3D The target detector can be applied to the pseudo point cloud .
Compared with real radar point cloud , There are some problems with the pseudo point cloud described above . First , Because monocular depth estimation must be inaccurate , It leads to position shift and shape deformation of pseudo point cloud , It could damage 3D Border regression . secondly , The accuracy of long-range target depth estimation is lower than that of short-range target depth estimation , The distortion of the remote target depth estimation increases obviously . These distorted pseudo point clouds will lead to a large number of false detection frames .
This paper presents a method called Neighbor-Vote( Neighbor voting ) Methods . To be specific , In this paper, we consider that every point around the target on the feature graph is “ voter ”. Voters need to vote for a certain number of nearby targets from their own perspective . Through this voting process , The false check target has a much lower winning rate than the real target , So it's easier to identify .
All in all , This paper makes the following three contributions :
- An efficient monocular image 3D Detection network . The network consists of four main steps : Pseudo point cloud generation 、2D ROI Score correlation 、 Attention based feature extraction and Adjacent office auxiliary prediction .
- This paper designs a neighbor voting method , It can effectively eliminate the false detection frame in pseudo point cloud prediction . This paper can adaptively combine neighbor prediction and local prediction , Thus, the accuracy of frame prediction is greatly improved .
- It turns out that , The method of this paper is in KITTI BEV The benchmark produces the best performance .
2. Neighbor voting system design
2.1 summary
chart 1 Neighbor-Vote Overall frame diagram
This paper presents a framework based on pseudo point cloud Neighbor-Vote, It aims to improve monocular performance by additional prediction of neighbor characteristics 3D object detection . Pictured 1 Shown , What this article puts forward Neighbor-Vote It's a single-stage detector , It consists of the following four main steps :
(1) Pseudo point cloud generation .
(2)2D ROI Score correlation .
(3) Feature extraction based on self attention .
(4) Neighbor voting AIDS goal prediction .
This paper is in figure 1 The whole framework is shown in , And discuss the four steps one by one below .
2.2 Pseudo point cloud generation
2.3 Foreground pseudo point cloud likelihood correlation
The depth estimation accuracy of long-range targets is much lower than that of short-range targets , As a result, the position shift of the pseudo laser point at a long distance is large . To compensate for inaccurate depth estimation , In this step , In this paper, we try our best to expand the areas of interest in the future (ROI) And the background , Especially distant objects . So , This article proposes that each prospect 2D Pixel ROI The score is associated with the corresponding pseudo laser point , Use scores to indicate the possibility of becoming a former scenic spot .
This paper finds that , stay 2D Image , A distant object is small and of low resolution , But it usually retains a certain degree of semantic information . in fact , Just KITTI For the automobile category of the dataset , In many 2D The detector's iou The threshold for 0.7 On the difficult level goal , Average precision (AP) Has reached the 75% above , Such as FCOS,CenterNet,Cascad R-CNN. According to this result , This paper proposes to use 2D Detector extraction ROI Area , The predicted score is correlated with the corresponding pseudo laser point .
This article USES the FCOS As 2D detector . The score of each pixel in the bounding box is projected to 3D In the space , then , This paper encodes the fraction as the fourth channel of the pseudo point cloud , As shown below :
2.4 Self attention feature extraction
Due to the severe displacement and deformation of the pseudo point cloud , The spatial context information that depends on the feature points around the target , To better identify the position and shape of the target , This information needs to extract features of relatively long distance . Use multi stacked... At each location 、 Convolution with fixed receiving domain can not effectively extract features with long enough distance . therefore , In this paper, the self attention mechanism is combined in the feature extraction module .
2.5 Frame prediction combined with neighbor voting
Neighbor voting As mentioned earlier , The pseudo point cloud is not as accurate as the real point cloud in describing the position and shape of the target . To meet this challenge , This paper proposes to utilize the feature points near the target ( This article is called “ neighbor ”), And let them help determine the location of the target . say concretely , In this paper, we use the individual viewpoint of each neighbor point , And try to form through the voting mechanism “ Consensus ”. Consider a feature map from an aerial view , Where and denote respectively x and z The size of the characteristic graph in the direction ,? Indicates the lower sampling rate . Feature points close to the prediction target are considered as voting neighbors or “ Voters ”. Each voter cast two votes . in other words , They can vote for the two closest goals , One forward and one backward ( stay ? Relative positioning in the direction )
among P Is a list of forecast targets . And are the front and back selected targets . ad locum , In this paper, we first let all feature points participate in voting , Then filter out those feature points whose votes exceed a certain distance , This makes all the voting neighbors really close to the predicted target , The voting process is shown in the figure 2 Shown .
chart 2 Description of the voting process
3. experiment
1. Verify the comparison results on the set . First , This paper is related to several recent monocles 3D The object detection model compares neighbor voting BEV and 3D check the accuracy :
surface 1 kitti Performance comparison on validation set .“ Additional information ” It means that besides 3D Other supervision outside the frame , among “mask” It refers to the label of the split task .
2. Ablation Experiment . In this paper, ablation experiments were carried out on the model , To analyze and verify the function of each module , As shown in the table 2 Shown .
surface 2 KITTI Ablation analysis on validation set . This paper quantifies the self attention module (SA)、ROI Score correlation (RA)、 Neighbors vote for branches (V) And the fusion of two taxonomic branches (F) Influence .
3.Neighbor-Vote Reduce the effectiveness of the false check box .neighbor-vote The basic principle behind it is , This paper argues that most feature points will vote for the real goal . therefore , The neighbor voting mechanism can effectively filter out the false detection box prediction . To confirm this principle , This paper compares baseline The Internet ( Only the pseudo point cloud generation module and 3D detector ) Different from the network in this article IoU The number of true positives and false positives of the threshold , As shown in the table 3 Shown . say concretely , When a forecast border and ground-truth Between IoU Greater than the preset threshold , for example 0.3、0.5 or 0.7, This prediction frame is considered a real target frame (TP); Otherwise, it is a false check box . Next , This paper calculates in baseline Error check box in the network but not in the network of this article (FP) The number of . here , In this paper, we will determine whether two frames coincide IoU Threshold set to 0.1—— When two borders IoU Is greater than 0.1 when , Think that these two borders point to the same goal . In this way , This paper reports the results of the error detection frame effectively removed by the network in this paper Lower bound . chart 3 The results show that , The network of this article eliminates kitti On validation set 73.8%(IoU = 0.5) and 55.4%(IoU = 0.7) Error check box for .
Last , This paper also verifies whether the model in this paper will also remove a large number of real target boxes (TP). Pictured 3(b) Shown , Only a small part TPs Will lose ,e.g. stay IoU=0.5 and IoU=0.7 They are 6.4% and 4.8%.
surface 3 KITTI On validation set FP Quantity and sum TP A relative change in quantity .
chart 3 In this paper (a) It was reported in baseline Network FPs, And successfully removed in the network of this article FPs; stay (b) Show in baseline Network TP The number of , And accidentally deleted TP The number of .
4. summary
In this work , In this paper, we propose a single item of neighbor voting 3D Target detection framework . The key difference from the previous work is , In this paper, the prediction of neighborhood feature points around the target is considered , To help improve the detection of severely deformed point clouds . By vote , Individual of each feature point 、 Noise prediction can together form an effective prediction . Besides , The neighbor prediction and local prediction are combined by adaptive weight , Get the final prediction result . stay KITTI Experiments on datasets demonstrate the effectiveness of this method .
remarks : The author is also us 「3D Vision goes from beginner to proficient 」 Special guests : A super dry 3D Visual learning community
Original solicitation
The original 3D The visual workshop is based on High quality original articles We media platform , The founders and partners are committed to Publishing 3D The most dry article in the field of vision , However, the power of a few people is limited after all , Knowledge blind spots and domain loopholes still exist . In order to better demonstrate domain knowledge , Now solicit contributions from all fans and readers , If your article is 3D Vision 、CV& Deep learning 、SLAM、 Three dimensional reconstruction 、 Point cloud post processing 、 Autopilot 、 Three dimensional measurement 、VR/AR、3D Face recognition 、 Medical imaging 、 defect detection 、 Pedestrian recognition 、 Target tracking 、 Visual products landing 、 Hardware selection 、 Job sharing and other directions , Welcome to smash the manuscript ~ The content of the article can be paper reading、 Resource summary 、 Summary of the actual combat of the project The form such as , The official account will provide corresponding information to each contributor Contribution fee , We support that knowledge is valuable !
边栏推荐
- My two-year persistence is worth it!
- Clickhouse alter table execution process
- Experience sharing on unified management and construction of virtual machine
- Oceanus practice - develop MySQL CDC to es SQL jobs from 0 to 1
- Summary of basic notes of C language (2)
- Cloud studio 2.0: the beginning of cloud
- Solution to the 39th weekly game of acwing
- Analysis of official template of micro build low code (I)
- Inferior administrator and black heart Haikang
- MySQL series tutorial (I) getting to know MySQL
猜你喜欢
随机推荐
What if the domain name is blocked? What can I do to quickly unseal?
A plate processing device of network separator which can adapt to different line port positions
How to build a website with a domain name? Is the website domain name free to use?
Risc-v instruction set explanation (4) R-type integer register register instruction
Clickhouse alter table execution process
A rail grinder for rail transit
How to resolve Chinese domain names? What is domain name resolution?
Basic concepts of complex networks
How to solve the enterprise network security problem in the mixed and multi cloud era?
Analysis of DDoS attack methods
12. Tencent cloud IOT device side learning -- NTP function and Implementation
How to solve the problem that easynvr calls the video download interface of the specified time period to display "being synthesized" and does not generate video?
What are the domain name registration query tools? What should be paid attention to when registering a domain name
How do websites apply for domain names? How to select a website domain name?
The development and construction of live broadcast app, and the source code of live broadcast app involves all aspects
How to use a Chinese domain name? Would you prefer a Chinese domain name or an English domain name?
How about the VIP domain name? Does the VIP domain name need to be filed after registration?
Flutter layout Basics - page navigation and return
Kubernetes Chapter 1: Foundation
The joint network security laboratory of runlian technology and Tencent security was officially unveiled



