当前位置：网站首页>3D semantic segmentation - PVD

3D semantic segmentation - PVD

2022-07-25 16:29:00 【Lemon_ Yam】

PVD（CVPR2022） Main contributions ：
Studied how to make Distillation of knowledge be applied to 3D Point cloud semantic segmentation In order to The model of compression
Put forward point-to-voxel The distillation of knowledge , So as to cope with the sparse point cloud data （sparsity）、 Random （randomness） And varying density （varying density） The inherent properties of
Put forward Supervoxel （supervoxel） Divide Method , send affinity distillation The process is easy to operate
Put forward difficulty-aware Sampling strategy of , Make it contain a few kinds （minority classes） And distant objects （ distant objects） Supervoxels of are easier to sample , thus Improve the distillation effect of these difficult objects （distillation efficacy）

PVD stay nuScenes and SemanticKITTI These two popular LiDAR Extensive experiments have been carried out on the segmentation benchmark , Its presence Cylinder3D、SPVNAS and MinkowskiNet These three representative backbones , Throughout It has a great advantage over the previous distillation method . It is worth noting that , In challenging nuScenes and SemanticKITTI On dataset , It can be used in competitive Cylinder3D The implementation on the model is about 75% Of MACs Reduce and 2 Double the speed , stay Waymo and SemanticKITTI（single-scan） In the challenge No. 1 , stay SemanticKITTI（multi-scan） In the challenge Rank third .

Network structure

In the figure below Cylinder3D As an example PVD Network structure , It contains Teachers' （teacher） And students （student） this 2 A network . among , Student network Number of channels per floor For Teachers Network Half , And the teacher network mainly from 5 Part of it is made up of , Namely Point feature extraction module （ point feature extraction module）、 Voxelization module （point-to-voxel transformation module）、 Codec module （encoder-decoder module）、DDCM modular 、 Point optimization module （point refinement module）

Insert picture description here

The input point cloud Divided into a fixed number of supervoxels , And according to difficulty-aware Sampling strategy sampling K A supervoxel （ In the figure K=1, Marked by a red box ）
The supervoxels to be sampled Input point feature extraction module （MLPs） in , obtain pointwise Output
Through the voxelization module pointwise Output Voxelization
The voxelized data is input Codec module （ Use an asymmetric three-dimensional convolution network ） obtain voxelwise Output
Voxels go through DDCM Module to Capture high rank contextual features , Thus, it has enough ability to capture context information
Send contextual features into Point optimization module （MLPs） Get in pointwise Output , So as to predict the relevant semantic information

️ In the framework of knowledge distillation , Student networks need to learn from teacher networks Two levels of knowledge , The first level is pointwise and voxelwise Output , The second level is inter-point and inter-voxel Of affinity matrix. About Cylinder3D Please refer to my previous blog for a brief introduction of 【3D Semantic segmentation ——Cylinder3D】

Point-to-Voxel Output Distillation

Compared with image data , The point cloud itself has sparsity , Which leads to It is difficult to train an effective student network through sparse supervision signals . Besides , Although point cloud data contains fine-grained environmental perception information , But because there is Thousands of dots , Lead to such knowledge Learning efficiency is very low . In order to improve learning efficiency , except pointwise Output , The paper suggests distillation （distil）voxelwise Output , because Fewer voxels and easier to learn .pointwise and voxelwise The distillation loss of is as follows ：
$\begin{aligned} L_{out}^p(O_S^p, O_T^p) &= \frac{1}{NC}\sum_{n=1}^N \sum_{c=1}^C KL(O_S^p(n, c)||O_T^p(n, c)) \\ L_{out}^v(O_S^v, O_T^v) &= \frac{1}{RAHC}\sum_{r=1}^R \sum_{a=1}^A \sum_{h=1}^H \sum_{c=1}^C KL(O_S^v(r, a, h, c) || O_T^v(r, a, h, c)) \end{aligned}$
️ among , $L_{out}^p$ by pointwise Distillation loss , $L_{out}^v$ by voxelwise Distillation loss . $N$ Is the number of points , $C$ Is the number of categories , $R$ Is the radius of voxel , $A$ Is the voxel angle , $H$ Is the voxel height , $KL(\cdot)$ by Kullback-Leibler divergence loss
️ The same voxel may contain points from different categories , therefore , How to assign proper labels to voxels is also crucial to performance . Paper use Cylinder3D Medium Most coding strategies （majority encoding strategy）, Use voxel The kind of label with the most points As a voxel label

Point-to-Voxel Affinity Distillation

Only right pointwise and voxelwise The output of knowledge distillation is not enough , Because it only considers the knowledge of each element , and Unable to capture the structural information of the surrounding environment . Because the input points are disordered , So this kind of Structural knowledge is very important for the semantic segmentation model based on lidar . A natural remedy is to use Relational knowledge distillation （relational knowledge distillation）, It calculates the similarity of all point features , But the scheme exists The calculation cost is too high 、 It's hard to learn And Ignore the point （ Different categories 、 Different distances ） Differences between These shortcomings . therefore , The paper passed Supervoxel partition To reduce computing costs and improve learning efficiency , adopt difficulty-aware Sampling to correctly deal with the differences between different points .

Supervoxel partition ： In order to learn relevant knowledge more efficiently , The paper will cover the whole point cloud Divided into multiple sizes $R_s \times A_s \times H_s$ Supervoxels . Every supervoxel It consists of a fixed number of voxels , And the total number of supervoxels is $N_s = \lceil \frac{R}{R_s} \rceil \times \lceil \frac{A}{A_s} \rceil \times \lceil \frac{H}{H_s} \rceil$ . In each distillation step , only sampling K A supervoxel Conduct affinity distillation
difficulty-aware sampling ： The sampling strategy is for Make supervoxels containing less frequent classes and distant objects easier to sample . The sampling strategy first determines the The weight , Again normalization The weight , Finally, the supervoxel is sampled probability . The relevant formula is as follows ：
$\begin{aligned} W_i &= \frac{1}{f_{class}} \times \frac{d_i}{R} \times \frac{1}{N_s} \\ f_{class} &= 4 \exp(-2N_{minor}) + 1 \\ P_i &= \frac{W_i}{\sum_{i=1}^{N_{s}}W_i} \end{aligned}$
️ among , $W_i$ For the first time i A supervoxel The weight , $f_{class}$ For classes frequency （class frequency）, $P_i$ For the first time i Supervoxels were sampled probability , $d_i$ For the first time i The outer arc of a supervoxel （outer arc） To XOY Of the origin of the face distance , $N_{minor}$ Is in the supervoxel The number of a few voxels .
️ Papers will be available in the entire data set exceed 1% The classes of points of are regarded as most classes , The rest are a few categories . A few voxels refer to their Class labels are a few classes , Voxel labels are based on Most coding strategies To make sure . When there are no few voxels , $f_{class}=5$ ; When a few voxels increase , $f_{class}$ Will quickly reduce , The minimum is 1
Feature handling ： For point clouds , Type of input point Quantity and density are variable , thus As a result, the number of point features and voxel features of supervoxels also changes . In calculating the loss , Usually Keep the number of features fixed （keep the number of features fixed）. To solve this problem , If the number of point features of supervoxels is greater than $N_p$ , Then random Remove redundant point features （ Labels are most classes ）; If the number of point features is less than $N_p$ , be Add all 0 Point features of . For voxel features （ The number of $N_v$ ）, A similar approach is also adopted . among , $N_p$ and $N_v$ It's artificial .

Insert picture description here

After the above processing of features , In the $r$ Among the supervoxels $N_p$ Point features $\hat{F}_r^p \in R^{N_p \times C_f}$ and $N_v$ Individual element characteristics $\hat{F}_r^v \in R^{N_v \times C_f}$ . Besides , For each supervoxel , The paper calculates its inter-point affinity matrix：
$C^p(i, j, r) = \frac{\hat{F}_r^p(i)^T \hat{F}_r^p(j)}{\parallel \hat{F}_r^p(i)^T \parallel_2 \parallel \hat{F}_r^p(j) \parallel_2}, r \in \{1, \cdots, K \}$
affinity score Get the similarity of each pair of point features , And the score can be regarded as a student network High level structural knowledge to learn （high-level structural knowledge）. Besides ,inter-point affinity The distillation loss is calculated as follows ：
$L_{aff}^p (C_S^p, C_T^p) = \frac{1}{KN_p^2}\sum_{r=1}^K \sum_{i=1}^{N_p} \sum_{j=1}^{N_p}\parallel C_S^p(i, j, r) - C_T^p(i, j, r) \parallel_2^2$
inter-voxel The calculation and inter-point similar , The distillation loss is as follows ：
$L_{aff}^v (C_S^v, C_T^v) = \frac{1}{KN_v^2}\sum_{r=1}^K \sum_{i=1}^{N_v} \sum_{j=1}^{N_v}\parallel C_S^v(i, j, r) - C_T^v(i, j, r) \parallel_2^2$

The total loss function

Of the network The total loss is borne by 7 Part of it is made up of , Namely pointwise and voxelwise Of weighted cross entropy Loss （1、2 term ）、lovasz-softmax Loss （ The first 3 term ）、point-to-voxel Distillation loss of （ after 4 term ）
$\begin{aligned} L = &L_{wce}^p + L_{wce}^v + L_{lovasz}\\ &+\alpha_1 L_{out}^p(O_S^p, O_T^p) + \alpha_2 L_{out}^v(O_S^v, O_T^v) \\ &+\beta_1 L_{aff}^p(C_S^p, C_T^p) + \beta_2 L_{aff}^v (C_S^v, C_T^v) \end{aligned}$
️ among , $\alpha_1$ 、 $\alpha_2$ 、 $\beta_1$ 、 $\beta_2$ Used to level Balance the impact of distillation loss on the loss of main tasks