『笔记』PV-RCNN

CVPR2020的文章，在自动驾驶领域Waymo Open Challenge点云挑战赛中取得了（所有不限传感器算法榜单）三项亚军，Lidar单模态算法三项第一的成绩，以及在KITTI Benchmark上保持总榜第一的成绩超过半年。个人感觉这篇文章在点云方向PointNet使用和VoxelNet使用的两个流派上实现了非常deep的结合。文章的方法部分写得非常清晰，绘图也是很绝，读下来很舒畅。

1. Introduction

Grid-based methods: transform the irregular point clouds to regular representations such as 3D voxels or 2D bird-view maps, which could be efficiently processed by 3D or 2D Convolutional Neural Networks (CNN). More computationally efficient but the inevitable information loss degrades the finegrained localization accuracy.

Point-based methods: directly extract discriminative features from raw point clouds. Higher computation cost but could 1. preserves accurate location information 2. with flexible receptive fields

3. PV-RCNN for Point Cloud Object Detection

PV-RCNN consists of a 3D voxel CNN with sparse convolution as the backbone for efficient feature encoding and proposal generation.
Given each 3D object proposal, to effectively pool its corresponding features from the scene, we propose two novel operations: the voxel-to-keypoint scene encod- ing, which summarizes all the voxels of the overall scene feature volumes into a small number of feature keypoints, and the point-to-grid RoI feature abstraction, which effectively aggregates the scene keypoint features to RoI grids for proposal confidence prediction and location refinement.

个人总结：处理点云的方式主要就是PointNet/PointNet++之流和VoxelNet之流，PointNet++直接处理raw点云，保留了原始信息，且ball query能让receptive field很灵活，但这样的set abstraction往往比较耗时，而且个人觉得不够structural. 而VoxelNet之类直接voxelization，扔上3D+2D CNN推理就完事了，计算效率相对较高，但resolution会降低，而且如果downsample后预测，分辨率更低，就算是upsample，receptive field也不像PointNet++那么灵活。我们早已知道two stage的思路就是先预测proposals，有了这些重点关心区域，借助它们再对proposal设计结合更专门的特征推理来refine它们。在之前的工作中，比如FastRCNN，第一阶段就是PointNet方式提取特征 + 预测proposals，然后第二阶段钻研foreground points，重新PointNet推理然后concat第一阶段它们的大特征们。这建立了非常fundamental的范式。而Fast R-CNN第一阶段就用了VoxelNet来提取特征 + 预测proposals，然后第二阶段同样基于concat，另外设计一个attention来加加权。但是可以想象，这个第二阶段直接用foreground points索voxel特征的方式太硬了，还是太naive

于是PV-RCNN来了，表示你们看看我怎么做的。我第一阶段还是用VoxelNet来做提取特征 + 预测proposals. 在第二阶段，重制重点区域/foreground points/proposals各自的local特征时，我不像Point RCNN一样aggregate neighbor点的特征，也不像Fast RCNN一样直接索对应voxel的特征，而是 1. Voxel Set Abstraction (VSA)：用keypoint，但是是aggregate "neighbor voxel"的特征。换句话说，我依然是在points依照PointNet++ set abstraction取keypoints，但是aggregate的却是voxel的特征。可以想象，我以真实的keypoint为中心，既保证了粒度，又用上了VoxelNet推理特征的好处，因为注意每个聚合的voxel特征的voxel coord相对该keypoint的位置是被concat进特征的。

具体而言，该过程进行于VoxelNet时的四个尺度feature map上，得到对于该帧点云计算好的keypoints的该位置对应的各尺度特征图上aggregate的voxel features，它们concatenated.

在这一步，除此之外，2. Extended VSA：用keypoint准确位置索2D bev feature map位置，用interpolation得到另外一个特征，以及keypoint原点云raw特征。这三部分来源特征，组成了代表我第二阶段FPS得到的每个keypoint的特征。接下来，3. Predicted Keypoint Weighting (PKW). 很明显，这个FPS过程还是uniform sampling，连一点重要性都没有，所以像Point RCNN一样，用一个点云segmentation的结果来对keypoints加权。

值得注意的是，训练segmentation的label是由3d box label内框定点这样的方式直接得到的，作者认为

The segmentation labels can be directly generated by the 3D detection box annotations, i.e. by checking whether each key point is inside or outside of a ground-truth 3D box since the 3D objects in autonomous driving scenes are nat- urally separated in 3D space.

这点在作者的前序工作Part A^2中也反复强调了，认为这个特点是我们3D这个application context下很free of charge的信息，要好好利用。关于Part A^2，还没有细读和总结，不过主要的点在于关注每个点在box内的相对位置，不然如果直接像PointNet方式max pooling，不同的box也会有一样的结果，产生ambiguity

现在我们准备好了segmentation角度重要的foreground points的keypoints的aggregated好的local features，接下来正式制作对于每个proposal的features. 4. RoI-grid Pooling via Set Abstraction. 这里的思路沿用了Part A^2的部分思想，就是proposal内点的intra位置是matter的，于是把proposal划分内部voxels，得到内部的grid points，再次以这些grid points为中心点来aggregate关联到准备好的keypoints特征。再次是指，很显然我们这里用到了第二次set abstraction，于是这个过程也很好理解了。最终，得到了代表每个proposal的feature，可以输head了

不难发现，先是准备好从segmentation角度重要的foreground points的keypoints的aggregated好的local features，然后在真正的proposal内聚合它们完成对proposal的最终表示，结合了foreground points和proposal两个表示重点区域的来源，而不是直接取segmented出来的foreground points或者取proposal内部的点，确实非常deep.

另外值得注意的是，confidence的head的target制作是依照IoU的，也就是我希望这个head是预测我这个proposal有多大可能性真的是个object，或者说是预测和真object重合了多少，这个和当时Faster R-CNN以及YOLOv1当时objectness都是老艺能了，只是好久没看也没用过，有点忘，再提醒一下

以上是读完方法之后自己按脑子里的思路缕顺下来的，下面直接把自己读完论文相关部分的截图放在这里吧，写得足够好了。

3.1. 3D Voxel CNN for Efficient Feature Encoding and Proposal Generation

We adopt it as the backbone of our framework for feature encoding and 3D proposal generation. The input points P are first divided into small voxels with spatial resolution of L × W × H, where the features of the non-empty voxels are directly calculated as the mean of point-wise features of all inside points. The commonly used features are the 3D coordinates and reflectance intensities. [...] By converting the encoded 8× downsampled 3D feature volumes into 2D bird-view feature maps, high-quality 3D proposals are generated following the anchor-based approaches. Specifically, we stack the 3D feature volume along the Z axis to obtain the L/8 × W/8 bird-view feature maps [...]

3.2. Voxel-to-keypoint Scene Encoding via Voxel Set Abstraction

Our proposed framework first aggregates the voxels at the multiple neural layers representing the entire scene into a small number of keypoints, which serve as a bridge between the 3D voxel CNN feature encoder and the proposal refinement network.

Note：确实有点像个bridge，因为voxel map features -> keypoint features -> per-proposal feature