当前位置：网站首页>3D semantic segmentation - scribed supervised lidar semantic segmentation

3D semantic segmentation - scribed supervised lidar semantic segmentation

2022-07-25 16:29:00 【Lemon_ Yam】

The paper （CVPR2022 Oral） Main contributions ：
Put forward the first graffiti mark （scribble-annotated） Lidar semantic segmentation data set ScribbleKITTI
Put forward Self training of class range balance （class-range-balanced self-training） To deal with pseudo tag pairs that occupy the main number of classes and close dense areas Preference （bias） problem
adopt Pyramid local semantic context descriptor （pyramid local semantic-context descriptor） To enhance the input point cloud , thus Improve the quality of fake labels
By changing the 2 and 3 Point and mean teacher frame combination , Proposed by the paper pipeline Available in use only 8% Under the annotation point of 95.7% Full supervision of （fully-supervised） performance

Dense annotation （densely annotating） LIDAR point cloud The cost is still too much , So we can't keep up with the growing amount of data . at present 3D The research work of semantic segmentation mainly focuses on Full supervision On the way , And the use of Weak supervision （weak supervision） To achieve effective 3D Semantic segmentation methods have not been explored . therefore , The paper proposes the use of graffiti （scribbles） Label the LIDAR point cloud , And released the first one for 3D Semantically segmented Graffiti tagging （scribble-annotated） Data sets ScribbleKITTI. But this also leads to those that contain edge information Not marked （unlabeled） Points are not used , And because Lack of a large number of annotation points （ This method only uses 8% Dimension point of ） The data of , Have affected Long tail distribution Class confidence of （ Less supervision ）, Eventually making The performance of the model has decreased .

therefore , This paper proposes a method to reduce the use of this weak annotation （weak annotations） When the performance gap appears pipeline, The pipeline It consists of three independent parts , Sure With any LiDAR Semantic segmentation model , The code of the thesis is Cylinder3D Model , If yes Cylinder3D If you are interested, please refer to my previous article Blog . Its presence Use only 8% In the case of labeling , Accessible 95.7% Fully supervised performance .

ScribbleKITTI Data sets

Insert picture description here

Use graffiti to mark on 2D Semantic segmentation is a popular and effective method , But with 2D The images are different ,3D Point clouds retain metrics （metric） Space , Cause it Have a high geometric structure . To solve this problem , The paper suggests using a more geometric Straight line graffiti （line-scribble） To mark the LIDAR point cloud , And free-formed Graffiti compared , Straight line graffiti can Faster annotation of geometric classes that span large distances （ Such as ： road , Sidewalk, etc ）, And straight line graffiti only needs to know these points （ Some kind of point cloud ） Of Start and end Location . As shown in the figure above, cars （ Blue lines ） Shown , You only need to determine two points to complete the annotation . This will make it necessary to spend 1.5-4.5 The marking time of hours is reduced to 10-25 minute .

ScribbleKITTI The dataset is based on SemanticKITTI Of train-split part To mark . among ,SemanticKITTI Of train-split Section contains 10 individual sequences、19130 individual scans、2349 A million points ; and ScribbleKITTI Contains only 189 Millions of annotation points .

Like above Figure 3 Shown , The straight-line graffiti in the paper mainly refers to 2D The line of is projected to 3D The surface of the , This will cause the straight-line graffiti to become very blurred when the angle of view changes （indistinguishable）.

Network structure

Insert picture description here

Proposed by the paper pipeline Can be divided into training、pseudo-labeling and distillation These three stages , These three stages are closely linked , Improve the quality of generated pseudo tags , So as to improve the accuracy of the model
stay training Stage , First, through PLS Come on Data enhancement , Retraining mean teacher, This is good for the back Generate higher quality pseudo tags
stay pseudo-labeling Stage , adopt CRB To generate target tags , Reduce the quality of pseudo tags generated due to the point cloud's own attributes
stay distillation Stage , Through the pseudo tag generated above Right again mean teacher Training
mean teacher in $L_S$ and $L_U$ They correspond to each other It's marked Point and Not marked Points of their respective losses

Partial Consistency Loss with Mean Teacher

mean teacher Frame by 2 Part of it is made up of , The weights are $\theta$ Of Student network And the weight is $\theta^{EMA}$ Of Teacher network . Usually , The weight of the student network passes gradient descent get , The weight of the teacher network is determined by Exponentially weighted average （exponential moving average） Students' weights are obtained , Its calculation formula is as follows ：

$\theta_t^{EMA} = \alpha \theta_{t-1}^{EMA} + (1-\alpha)\theta_t$

️ among , $\theta_t$ For the first time $t$ In step Weight of student network , $\theta_t^{EMA}$ For the first time $t$ In step Weight of teacher network , $\alpha$ by Smoothing factor . By exponentially weighted averaging , can avoid Temporal Ensembling Limitations of the method , And you can get A more accurate model （ Compared with directly using the trained weight ）

partial consistency loss Only will consistency loss Use on unmarked points , This can be done by Reduce teacher network injection （injected） uncertainty Come on Marking point Carry out stricter supervision , At the same time, use more accurate teacher network output to Unmarked points To supervise . The loss function is as follows ：

$\min_{\theta} \sum_{f=1}^{F} \sum_{i=1}^{|P_f|}G_{i, f} = \begin{cases} H(\hat{y}_{f, i}|_{\theta}, y_{f, i}), & p_{f, i} \in S \\ log(\hat{y}_{f, i}|_{\theta})\hat{y}_{f, i}|_{\theta^{EMA}}, & p_{f, i} \in U \end{cases}$

️ among , $S$ There are marked points , $U$ Are unmarked points , $H$ Is the loss function （ Usually it is cross-entropy）, $F$ Is the number of point cloud frames , $P_f|$ Is the number of points in a frame , $\hat{y}_{f, i}|_{\theta}$ For the predicted value of the student network , $y_{f, i}$ For real value , $\hat{y}_{f, i}|_{\theta^{EMA}}$ Predicted value for teacher network , $p_{f, i}$ For the first time f Second frame i A little bit

Even though mean teacher Supervise the unmarked points , But because of the teacher network performance （performance） Influence , Its The information obtained is still limited . Even if the teacher network correctly predicts the label of a point , But because of Soft fake tags （soft pseudo-labeling） For its own reason , The confidence of other tags will still affect the output of student Networks .

Class-range-balanced Self-training (CRB-ST)

In response to the above Injected uncertainty And more directly use the confidence of unmarked point prediction , This paper extends the annotation data set and uses self-training. The paper passes self-training And mean teacher Let's introduce , The purpose is to keep mean teacher Soft pseudo tags for uncertain predictions To guide the （guidance）, meanwhile strengthening False labels for some predictions . By using Teacher network Predicted in Maximum confidence That kind of , It can be an unmarked point Generate a set of target tags $L$ .
Insert picture description here
Due to the nature of lidar Sensor , Local point The density varies with the beam radius , Sparsity increases with distance . This leads to false label owners Sample from dense areas , Its estimation confidence is often high . In order to reduce this problem in pseudo tag generation , This paper proposes a modified self-training Scheme and balance with class scope （ CRB） Use a combination of . Firstly, the horizontal plane is roughly divided into ego-vehicle Centered Width is B Of R Annulus , Each ring point contains points within a certain distance , From these points we can pseudo mark （pseudo-label） List each category Global highest confidence forecast . This ensures that we get reliable labels , At the same time, in different scopes and all classes Distribute them proportionally . The loss function is as follows ：

$\begin{aligned} &\min_{\theta, \hat{y}}\sum_{f=1}^{F}\sum_{i=1}^{|P_f|} [G_{i, f} - \sum_{c=1}^C \sum_{r=1}^R F_{i, f, c, r}] \\ & F_{i, f, c, r} = \begin{cases} (log(\hat{y}_{f, i}^{(c)}|_{\theta^{EMA}})+k^{(c, r)})\hat{y}_{f, i}^{(c)}, \quad &r = \lfloor \parallel(p_{x, y})_{f, i} \parallel/B \rfloor \\ 0, & otherwise\end{cases} \end{aligned}$

️ among , $k^{(c, r)}$ Is a ring like （class-annulus） Paired Negative logarithmic threshold （negative log-threshold）, $R$ Is the number of rings

In order to solve nonlinear integer optimization （nonlinear integer optimization） problem , The paper adopts the following solver ：
$\hat{y}_{f, i}^{(c)*} = \begin{cases} 1, if &c= argmax\hat{y}_{f, i}|_{\theta^{EMA}},\\ &\hat{y}_{f, i}|_{\theta} \gt exp(-k^{(c, r)}) \\ &with \ r = \lfloor \parallel (p_{x, y})_{f, i} \parallel / B \rfloor\\ 0, &otherwise\end{cases}$

Pyramid Local Semantic-context (PLS)

in order to Fake tags that guarantee higher quality , The paper further introduces a new descriptor （descriptor）, It uses available graffiti （scribbles） To enrich （enrich） Characteristics of the initial point .

The paper finds that class tags are in 3D Space exists Spatial smoothing （spatial smoothness） Constraints and Semantic patterns （semantic pattern） constraint , Spatial smoothing constraint means that a point in space may be at least A point adjacent to it has the same class label , Semantic pattern constraints refer to domination Spatial relationships between classes A complex set of High level rules （high-level rules）. therefore , The paper believes that Local semantic Apriori （local semantic prior） As a rich point descriptor （rich point descriptor） To encapsulate spatial smoothing constraints and semantic schema constraints , And put forward in Zoom resolution （scaling resolutions） Next use Local semantic context （semantic-context） To reduce Labeled points and unlabeled points （labeled-unlabeled point） Spread between Ambiguity of information And improve the quality of pseudo tags .
Insert picture description here

At first, the space is discretized into Rough voxels （coarse voxels）, This step can be avoided Over description （over-descriptive） The characteristics of the network lead to graffiti marks Over fitting , To make it Generalization ability and the ability to understand meaningful geometric relationships are stronger . In order to meet the requirements of lidar sensors at different resolutions Inherent point distribution , We are Cylindrical coordinate system Used in Various sizes bins, For each bin $b_i$ Calculate another rough Histogram , And then put these normalized Histogram Splice together （ Like above Figure 6 Shown ）, Its calculation formula is as follows ：
$\begin{aligned} \pmb{h}_i &= [h_i^{(1)}, \cdots, h_i^{(C)}] \in R^C \\ h_i^{(c)} &= \#\{y_j = c \forall j | p_j \in b_i \} \\ \pmb{PLS} &= [\pmb{h}_i^1/max(\pmb{h}_i^1), \cdots, \pmb{h}_i^s/max(\pmb{h}_i^s)] \in R^{sC} \end{aligned}$

️ among , $\pmb{h}_i$ It is the result of stitching all histograms together at a certain resolution , $h_i^{(c)}$ Is a certain kind of histogram , $P L S$ Is all under all resolutions normalized The result of histogram splicing , $s$ Refers to resolution ？

️ take $P L S$ Add to the input feature , And redefine the input LIDAR point cloud as Augment $P_{aug} = \{p|p=(x, y, z, I, PLS) \in R^{sC}\}$ , $x, y, z$ Corresponding point 3D coordinate , and $I$ by Reflection intensity . During the training , Use $P_{aug}$ Instead of the original data set, it can be found in pseudo-labeling Stage Generate higher quality pseudo tags .