『论文』PointPillars

『论文』PointPillars

1. Introduction#

There are two key differences: 1) the point cloud is a sparse representation, while an image is dense and 2) the point cloud is 3D, while the image is 2D
Some early works focus on either 3D convolutions or a projection of the point cloud into the image. Recent methods tend to view the lidar point cloud from a bird’s eye view
However, the bird’s eye view tends to be extremely sparse which makes direct application of convolutional neural networks impractical and inefficient

  • A common workaround: partition the ground plane into a regular grid
  • VoxelNet is good, but limited in speed
  • In this work we propose PointPillars: a method for object detection in 3D that enables end-to-end learning with only 2D convolutional layers. PointPillars uses a novel encoder that learns features on pillars (vertical columns) of the point cloud to predict 3D oriented boxes for objects.

1.1.1 Object detection using CNNs#

In this work, we use a single stage method

1.1.2 Object detection in lidar point clouds#

Object detection in point clouds is an intrinsically 3d problem. So a 3D cnn is natural, but slow
In the most common paradigm the point cloud is organized in voxels and the set of voxels in each vertical column is encoded into a fixed-length, handcrafted, feature encoding to form a pseudo-image which can be processed by a standard image detection architecture
MV3D, AVOD(two stage) PIXOR, Complex YOLO(one stage)
PointNet, VoxelNet but still slow, Frustum PointNet but not end-to-end, SECOND improves VoxelNet but still havs 3D conv layers

2. PointPillars Network#

image

PointPillars accepts point clouds as input and estimates oriented 3D boxes for cars, pedestrians and cyclists. It consists of three main stages (Figure 2):

  • (1) A feature encoder network that converts a point cloud to a sparse pseudo-image;
  • (2) a 2D convolutional backbone to process the pseudo-image into high-level representation;
  • (3) a detection head that detects and regresses 3D boxes

2.1. Pointcloud to Pseudo-Image#

  • To apply a 2D convolutional architecture, we first convert the point cloud to a pseudo-image.
  • We denote by l a point in a point cloud with coordinates x,y,z and reflectance r. As a first step the point cloud is discretized into an evenly spaced grid in the x-y plane, creating a set of pillars P
  • The points in each pillar are then augmented with xc,yc,zc and xp,yp (subscript c: the distance to the arithmetic mean of all points in the pillar, subscript p: the offset from the pillar center). Now for each point we have a 9-d vector
  • Due to the sparsity, most pillars will be empty, and non-empty pillars will in general have few points in them. We impose limits on the number of non-empty pillars per sample (the whole point cloud) P and the number of points per pillar N. Finally we get a dense vector of size (D, P, N)
  • Note: to conclude, (D, P, N) is, for the whole sample(contains all pointclouds in this frame), we have P pillars(P=H*W), for each pillar we have N points, vertically. And for each point we have information of a 9-d vector
  • Then, each point is inputted to a linear layer + BN + ReLU. So we get a (C, P, N) sized tensor. (see the info of each point is from D to C)
  • Then, for each pillar, a max operation is applied over the vertical points of each pillar (channel-wised, so it is like a simlified version of pointnet, processing on each point), to get the tensor sized (C, P). (So the "Learned Features" graph is a vertical view of "Stacked Pillars")
  • Then, for each pillar, a max operation is applied over the vertical points of each pillar (channel-wised, so it is like a simlified version of pointnet, processing on each point), to get the tensor sized (C, P). (So the "Learned Features" graph is a vertical view of "Stacked Pillars")
  • Now the tensor can be scattered back to the original pillar locations to create a pseudo-image of size (C, H, W) where H and W indicate the height and width of the canvas

2.2. Backbone#

The backbone has two sub-networks: one top-down network that produces features at increasingly small spatial resolution and a second network that performs upsampling. The final output features are a concatenation of all features that originated from different strides.

2.3. Detection Head#

Single Shot Detector (SSD)
Bounding box height and elevation were not used for matching; instead given a 2D match, the height and elevation become additional regression targets.

3. Implementation Details#

3.1. Network#

Instead of pre-training our networks, all weights were initialized randomly using a uniform distribution as in [8].
The encoder network has C = 64 output features. The car and pedestrian/cyclist backbones are the same except for the stride of the first block (S = 2 for car, S = 1 for pedestrian/cyclist). Both network consists of three blocks, Block1(S, 4, C), Block2(2S, 6, 2C), and Block3(4S, 6, 4C). Each block is upsampled by the following upsampling steps: Up1(S, S, 2C), Up2(2S, S, 2C) and Up3(4S, S, 2C). Then the features of Up1, Up2 and Up3 are concatenated together to create 6C features for the detection head.

3.2. Loss#

image

4. Experimental setup#

4.1. Dataset#

All experiments use the KITTI object detection benchmark dataset. The samples are originally divided into 7481 training and 7518 testing samples
For experimental studies we split the official training into 3712 training samples and 3769 validation samples. we follow the standard convention [2, 31] of only using lidar points that project into the image. Following the standard literature practice on KITTI [11, 31, 28], we train one network for cars and one network for both pedestrians and cyclists.

4.2. Settings#

Voxel map settings

  • Voxel resolution: 0.16 m, max number of pillars (P): 12000, max number of points per pillar (N): 100

About anchors

  • We use the same anchors and matching strategy as VoxelNet: Each class anchor is described by a width, length, height, and z center, and is applied at two orientations: 0 and 90 degrees. Anchors are matched to ground truth using the 2D IoU with the following rules: A positive match is either the highest with a ground truth box, or above the positive match threshold, while a negative match is below the negative threshold. All other anchors are ignored (neutral) in the loss
  • Car. The x, y, z range is [(0, 70.4), (-40, 40), (-3, 1)] meters respectively. The car anchor has width, length, and height of (1.6, 3.9, 1.5) m with a z center of -1 m. Matching uses positive and negative thresholds of 0.6 and 0.45
  • Pedestrian & Cyclist. The x, y, z range of [(0, 48), (-20, 20), (-2.5, 0.5)] meters respectively. The pedestrian anchor has width, length, and height of (0.6, 0.8, 1.73) meters with a z center of -0.6 meters, while the cyclist anchor has width, length, and height of (0.6, 1.76, 1.73) meters with a z center of -0.6 meters. Matching uses positive and negative thresholds of 0.5 and 0.35

Post-processing setting

  • At inference time we apply axis aligned non maximum suppression (NMS) with an overlap threshold of 0.5 IoU

4.3. Data Augmentation#

First, following SECOND, for each sample, we randomly select 15; 0; 8 ground truth samples for cars, pedestrians, and cyclists respectively and place them into the current point cloud (ground-truth database sampling). Next, all ground truth boxes are individually augmented. rotated and translated
Finally, we perform two sets of global augmentations that are jointly applied to the point cloud and all boxes. random mirroring flip along x axis, global rotation and global scaling, global translation

5. Results#

Quantitative Analysis

  • Official KITTI evaluation detection metrics are: bird’s eye view (BEV), 3D, 2D, and average orientation similarity (AOS). The 2D detection is done in the image plane. Average orientation similarity assesses the average orientation (measured in BEV) similarity for 2D detections.
  • The KITTI dataset is stratified into easy, moderate, ranked by performance on moderate

Qualitative Analysis

  • ...

6. Realtime Inference#

As indicated by our results (Table 1 and Figure 5), PointPillars represent a significant improvement in terms of inference runtime. All runtimes are measured on a desktop with an Intel i7 CPU and a 1080ti GPU
The main inference steps are as follows. First, the point cloud is loaded and filtered based on range and visibility in the images (1.4 ms). Then, the points are organized in pillars and decorated (2.7 ms). Next, the PointPillar tensor is uploaded to the GPU (2.9 ms), encoded (1.3 ms), scattered to the pseudo-image (0.1 ms), and processed by the backbone and detection heads (7.7 ms). Finally NMS is applied on the CPU (0.1 ms) for a total runtime of 16.2 ms

  • Encoding: The key design to enable this runtime is the PointPilar encoding. For example, at 1:3 ms it is 2 orders of magnitude faster than the VoxelNet encoder (190 ms)
  • Slimmer Design: We opt for a single PointNet in our encoder, compared to 2 sequential PointNets suggested by VoxelNet.
  • TensorRT: Switching to TensorRT gave a 45.5% speedup from the PyTorch pipeline which runs at 42.4 Hz.
  • The Need for Speed: While it could be argued that such runtime is excessive since a lidar typically operates at 20 Hz

7. Ablation Studies#

7.1. Spatial Resolution#

A trade-off between speed and accuracy can be achieved by varying the size of the spatial binning. Smaller pillars allow finer localization and lead to more features, while larger pillars are faster due to fewer non-empty pillars (speeding up the encoder) and a smaller pseudo-image (speeding up the CNN backbone) Note: binning means how you set up the bins(grids)

7.2. Per Box Data Augmentation#

Both VoxelNet [31] and SECOND [28] recommend extensive per box augmentation. However, in our experiments, minimal box augmentation worked better
Our hypothesis is: the introduction of ground truth sampling mitigates the need for extensive per box augmentation

7.3. Point Decorations#

During the lidar point decoration step, we perform the VoxelNet [31] decorations plus two additional decorations: x p and yp which are the x and y offset from the pillar x; y center. Note: decoration means our added info to the points

7.4. Encoding#

To assess the impact of the proposed PointPillar encoding in isolation, we implemented several encoders in the official codebase of SECOND [28]. Note: encoding means you encode those raw point clouds into a specific format, like here the key is encoding them into a pointpillar format

作者:traviscui

出处:https://www.cnblogs.com/traviscui/p/16558376.html

版权:本作品采用「署名-非商业性使用-相同方式共享 4.0 国际」许可协议进行许可。

posted @   traviscui  阅读(77)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· .NET10 - 预览版1新功能体验(一)
more_horiz
keyboard_arrow_up dark_mode palette
选择主题
menu
点击右上角即可分享
微信分享提示