当前位置:网站首页>Neighbor vote: use proximity voting to optimize monocular 3D target detection (ACM mm2021)

Neighbor vote: use proximity voting to optimize monocular 3D target detection (ACM mm2021)

2022-06-24 06:03:00 3D vision workshop

name :Neighbor-Vote: Improving Monocular 3D Object Detection through Neighbor Distance Voting

link :https://arxiv.org/pdf/2107.02493.pdf

Abstract : As the camera is used more and more widely in new application fields such as automatic driving , The monocular image is 3D Object detection has become an important task of visual scene understanding . Monocular 3D The latest development of object detection depends on the generation of pseudo point cloud , That is, monocular depth estimation , Promote 2D pixel points to Pseudo 3D points . However , The depth estimation accuracy of monocular image is not high , It will inevitably cause the position of the pseudo point cloud in the target to shift . therefore , The predicted frame may have inaccurate position and shape deformation . In this paper , This paper presents a novel neighbor voting method , The neighbor prediction is helpful to improve target detection from severely deformed pseudo point clouds . To be specific , Each feature point forms its own prediction , Then vote to build “ Consensus ”. In this way , This paper can effectively combine neighbor prediction with local prediction , Achieve more accurate 3D testing . To further zoom in ROI The difference between a pseudo point and a background point , In addition, this paper will discuss 2D Of foreground pixels ROI The prediction score is encoded into the corresponding pseudo - 3D light . In this paper KITTI The proposed method is verified on benchmark , The bird's-eye inspection results on the verification set are better than the current SOTA, Especially for “ difficult ” Level detection .

1. introduction

3D Target detection depends on understanding 3D Application of context in the world ( For example, autonomous driving ) One of the most important tasks in . There have been many point cloud based 3D Target detection algorithm . Although these methods have achieved excellent performance , However , Lidar is still too expensive , It can't be equipped on every car . therefore , Cheaper alternatives are more popular , Especially the camera , Because their prices are low , High frame rate .

On the other hand , Due to the lack of depth information , stay RGB Images , Especially on monocular images 3D testing , Still a daunting challenge . To solve this challenge , There are already methods : First, depth information is estimated from monocular images , And then 2D Pixel conversion to pseudo 3D. And then 3D The target detector can be applied to the pseudo point cloud .

Compared with real radar point cloud , There are some problems with the pseudo point cloud described above . First , Because monocular depth estimation must be inaccurate , It leads to position shift and shape deformation of pseudo point cloud , It could damage 3D Border regression . secondly , The accuracy of long-range target depth estimation is lower than that of short-range target depth estimation , The distortion of the remote target depth estimation increases obviously . These distorted pseudo point clouds will lead to a large number of false detection frames .

This paper presents a method called Neighbor-Vote( Neighbor voting ) Methods . To be specific , In this paper, we consider that every point around the target on the feature graph is “ voter ”. Voters need to vote for a certain number of nearby targets from their own perspective . Through this voting process , The false check target has a much lower winning rate than the real target , So it's easier to identify .

All in all , This paper makes the following three contributions :

  • An efficient monocular image 3D Detection network . The network consists of four main steps : Pseudo point cloud generation 2D ROI Score correlation Attention based feature extraction and Adjacent office auxiliary prediction .
  • This paper designs a neighbor voting method , It can effectively eliminate the false detection frame in pseudo point cloud prediction . This paper can adaptively combine neighbor prediction and local prediction , Thus, the accuracy of frame prediction is greatly improved .
  • It turns out that , The method of this paper is in KITTI BEV The benchmark produces the best performance .

2. Neighbor voting system design

2.1 summary

chart 1 Neighbor-Vote Overall frame diagram

This paper presents a framework based on pseudo point cloud Neighbor-Vote, It aims to improve monocular performance by additional prediction of neighbor characteristics 3D object detection . Pictured 1 Shown , What this article puts forward Neighbor-Vote It's a single-stage detector , It consists of the following four main steps :

(1) Pseudo point cloud generation .

(2)2D ROI Score correlation .

(3) Feature extraction based on self attention .

(4) Neighbor voting AIDS goal prediction .

This paper is in figure 1 The whole framework is shown in , And discuss the four steps one by one below .

2.2 Pseudo point cloud generation

2.3 Foreground pseudo point cloud likelihood correlation

The depth estimation accuracy of long-range targets is much lower than that of short-range targets , As a result, the position shift of the pseudo laser point at a long distance is large . To compensate for inaccurate depth estimation , In this step , In this paper, we try our best to expand the areas of interest in the future (ROI) And the background , Especially distant objects . So , This article proposes that each prospect 2D Pixel ROI The score is associated with the corresponding pseudo laser point , Use scores to indicate the possibility of becoming a former scenic spot .

This paper finds that , stay 2D Image , A distant object is small and of low resolution , But it usually retains a certain degree of semantic information . in fact , Just KITTI For the automobile category of the dataset , In many 2D The detector's iou The threshold for 0.7 On the difficult level goal , Average precision (AP) Has reached the 75% above , Such as FCOS,CenterNet,Cascad R-CNN. According to this result , This paper proposes to use 2D Detector extraction ROI Area , The predicted score is correlated with the corresponding pseudo laser point .

This article USES the FCOS As 2D detector . The score of each pixel in the bounding box is projected to 3D In the space , then , This paper encodes the fraction as the fourth channel of the pseudo point cloud , As shown below :

2.4 Self attention feature extraction

Due to the severe displacement and deformation of the pseudo point cloud , The spatial context information that depends on the feature points around the target , To better identify the position and shape of the target , This information needs to extract features of relatively long distance . Use multi stacked... At each location 、 Convolution with fixed receiving domain can not effectively extract features with long enough distance . therefore , In this paper, the self attention mechanism is combined in the feature extraction module .

2.5 Frame prediction combined with neighbor voting

Neighbor voting As mentioned earlier , The pseudo point cloud is not as accurate as the real point cloud in describing the position and shape of the target . To meet this challenge , This paper proposes to utilize the feature points near the target ( This article is called “ neighbor ”), And let them help determine the location of the target . say concretely , In this paper, we use the individual viewpoint of each neighbor point , And try to form through the voting mechanism “ Consensus ”. Consider a feature map from an aerial view , Where and denote respectively x and z The size of the characteristic graph in the direction ,? Indicates the lower sampling rate . Feature points close to the prediction target are considered as voting neighbors or “ Voters ”. Each voter cast two votes . in other words , They can vote for the two closest goals , One forward and one backward ( stay ? Relative positioning in the direction )

among P Is a list of forecast targets . And are the front and back selected targets . ad locum , In this paper, we first let all feature points participate in voting , Then filter out those feature points whose votes exceed a certain distance , This makes all the voting neighbors really close to the predicted target , The voting process is shown in the figure 2 Shown .

chart 2 Description of the voting process

3. experiment

1. Verify the comparison results on the set . First , This paper is related to several recent monocles 3D The object detection model compares neighbor voting BEV and 3D check the accuracy :

surface 1 kitti Performance comparison on validation set .“ Additional information ” It means that besides 3D Other supervision outside the frame , among “mask” It refers to the label of the split task .

2. Ablation Experiment . In this paper, ablation experiments were carried out on the model , To analyze and verify the function of each module , As shown in the table 2 Shown .

surface 2 KITTI Ablation analysis on validation set . This paper quantifies the self attention module (SA)、ROI Score correlation (RA)、 Neighbors vote for branches (V) And the fusion of two taxonomic branches (F) Influence .

3.Neighbor-Vote Reduce the effectiveness of the false check box .neighbor-vote The basic principle behind it is , This paper argues that most feature points will vote for the real goal . therefore , The neighbor voting mechanism can effectively filter out the false detection box prediction . To confirm this principle , This paper compares baseline The Internet ( Only the pseudo point cloud generation module and 3D detector ) Different from the network in this article IoU The number of true positives and false positives of the threshold , As shown in the table 3 Shown . say concretely , When a forecast border and ground-truth Between IoU Greater than the preset threshold , for example 0.3、0.5 or 0.7, This prediction frame is considered a real target frame (TP); Otherwise, it is a false check box . Next , This paper calculates in baseline Error check box in the network but not in the network of this article (FP) The number of . here , In this paper, we will determine whether two frames coincide IoU Threshold set to 0.1—— When two borders IoU Is greater than 0.1 when , Think that these two borders point to the same goal . In this way , This paper reports the results of the error detection frame effectively removed by the network in this paper Lower bound . chart 3 The results show that , The network of this article eliminates kitti On validation set 73.8%(IoU = 0.5) and 55.4%(IoU = 0.7) Error check box for .

Last , This paper also verifies whether the model in this paper will also remove a large number of real target boxes (TP). Pictured 3(b) Shown , Only a small part TPs Will lose ,e.g. stay IoU=0.5 and IoU=0.7 They are 6.4% and 4.8%.

surface 3 KITTI On validation set FP Quantity and sum TP A relative change in quantity .

chart 3 In this paper (a) It was reported in baseline Network FPs, And successfully removed in the network of this article FPs; stay (b) Show in baseline Network TP The number of , And accidentally deleted TP The number of .

4. summary

In this work , In this paper, we propose a single item of neighbor voting 3D Target detection framework . The key difference from the previous work is , In this paper, the prediction of neighborhood feature points around the target is considered , To help improve the detection of severely deformed point clouds . By vote , Individual of each feature point 、 Noise prediction can together form an effective prediction . Besides , The neighbor prediction and local prediction are combined by adaptive weight , Get the final prediction result . stay KITTI Experiments on datasets demonstrate the effectiveness of this method .

remarks : The author is also us 「3D Vision goes from beginner to proficient 」 Special guests : A super dry 3D Visual learning community

Original solicitation

The original 3D The visual workshop is based on High quality original articles We media platform , The founders and partners are committed to Publishing 3D The most dry article in the field of vision , However, the power of a few people is limited after all , Knowledge blind spots and domain loopholes still exist . In order to better demonstrate domain knowledge , Now solicit contributions from all fans and readers , If your article is 3D Vision CV& Deep learning SLAM Three dimensional reconstruction Point cloud post processing Autopilot 、 Three dimensional measurement 、VR/AR、3D Face recognition 、 Medical imaging 、 defect detection 、 Pedestrian recognition 、 Target tracking 、 Visual products landing 、 Hardware selection 、 Job sharing and other directions , Welcome to smash the manuscript ~ The content of the article can be paper reading、 Resource summary 、 Summary of the actual combat of the project The form such as , The official account will provide corresponding information to each contributor Contribution fee , We support that knowledge is valuable !

原网站

版权声明
本文为[3D vision workshop]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/07/20210727114001377w.html

随机推荐