『论文』PointPillars

1. Introduction

There are two key differences: 1) the point cloud is a sparse representation, while an image is dense and 2) the point cloud is 3D, while the image is 2D
Some early works focus on either 3D convolutions or a projection of the point cloud into the image. Recent methods tend to view the lidar point cloud from a bird’s eye view
However, the bird’s eye view tends to be extremely sparse which makes direct application of convolutional neural networks impractical and inefficient

A common workaround: partition the ground plane into a regular grid
VoxelNet is good, but limited in speed
In this work we propose PointPillars: a method for object detection in 3D that enables end-to-end learning with only 2D convolutional layers. PointPillars uses a novel encoder that learns features on pillars (vertical columns) of the point cloud to predict 3D oriented boxes for objects.

1.1.1 Object detection using CNNs

In this work, we use a single stage method

1.1.2 Object detection in lidar point clouds

Object detection in point clouds is an intrinsically 3d problem. So a 3D cnn is natural, but slow
In the most common paradigm the point cloud is organized in voxels and the set of voxels in each vertical column is encoded into a fixed-length, handcrafted, feature encoding to form a pseudo-image which can be processed by a standard image detection architecture
MV3D, AVOD(two stage) PIXOR, Complex YOLO(one stage)
PointNet, VoxelNet but still slow, Frustum PointNet but not end-to-end, SECOND improves VoxelNet but still havs 3D conv layers

2. PointPillars Network

PointPillars accepts point clouds as input and estimates oriented 3D boxes for cars, pedestrians and cyclists. It consists of three main stages (Figure 2):

(1) A feature encoder network that converts a point cloud to a sparse pseudo-image;
(2) a 2D convolutional backbone to process the pseudo-image into high-level representation;
(3) a detection head that detects and regresses 3D boxes

2.1. Pointcloud to Pseudo-Image

To apply a 2D convolutional architecture, we first convert the point cloud to a pseudo-image.
We denote by \(l\) a point in a point cloud with coordinates \(x,y,z\) and reflectance \(r\). As a first step the point cloud is discretized into an evenly spaced grid in the x-y plane, creating a set of pillars \(P\)
The points in each pillar are then augmented with \(x_c,y_c,z_c\) and \(x_p,y_p\) (subscript c: the distance to the arithmetic mean of all points in the pillar, subscript p: the offset from the pillar center). Now for each point we have a 9-d vector
Due to the sparsity, most pillars will be empty, and non-empty pillars will in general have few points in them. We impose limits on the number of non-empty pillars per sample (the whole point cloud) P and the number of points per pillar N. Finally we get a dense vector of size (D, P, N)
Note: to conclude, (D, P, N) is, for the whole sample(contains all pointclouds in this frame), we have P pillars(P=H*W), for each pillar we have N points, vertically. And for each point we have information of a 9-d vector
Then, each point is inputted to a linear layer + BN + ReLU. So we get a (C, P, N) sized tensor. (see the info of each point is from D to C)
Then, for each pillar, a max operation is applied over the vertical points of each pillar (channel-wised, so it is like a simlified version of pointnet, processing on each point), to get the tensor sized (C, P). (So the "Learned Features" graph is a vertical view of "Stacked Pillars")
Then, for each pillar, a max operation is applied over the vertical points of each pillar (channel-wised, so it is like a simlified version of pointnet, processing on each point), to get the tensor sized (C, P). (So the "Learned Features" graph is a vertical view of "Stacked Pillars")
Now the tensor can be scattered back to the original pillar locations to create a pseudo-image of size (C, H, W) where H and W indicate the height and width of the canvas

2.2. Backbone

The backbone has two sub-networks: one top-down network that produces features at increasingly small spatial resolution and a second network that performs upsampling. The final output features are a concatenation of all features that originated from different strides.

2.3. Detection Head

Single Shot Detector (SSD)
Bounding box height and elevation were not used for matching; instead given a 2D match, the height and elevation become additional regression targets.

3. Implementation Details

3.1. Network

Instead of pre-training our networks, all weights were initialized randomly using a uniform distribution as in [8].
The encoder network has C = 64 output features. The car and pedestrian/cyclist backbones are the same except for the stride of the first block (S = 2 for car, S = 1 for pedestrian/cyclist). Both network consists of three blocks, Block1(S, 4, C), Block2(2S, 6, 2C), and Block3(4S, 6, 4C). Each block is upsampled by the following upsampling steps: Up1(S, S, 2C), Up2(2S, S, 2C) and Up3(4S, S, 2C). Then the features of Up1, Up2 and Up3 are concatenated together to create 6C features for the detection head.

3.2. Loss

4. Experimental setup

4.1. Dataset

All experiments use the KITTI object detection benchmark dataset. The samples are originally divided into 7481 training and 7518 testing samples
For experimental studies we split the official training into 3712 training samples and 3769 validation samples. we follow the standard convention [2, 31] of only using lidar points that project into the image. Following the standard literature practice on KITTI [11, 31, 28], we train one network for cars and one network for both pedestrians and cyclists.

4.2. Settings

Voxel map settings

Voxel resolution: 0.16 m, max number of pillars (P): 12000, max number of points per pillar (N): 100

About anchors

We use the same anchors and matching strategy as VoxelNet: Each class anchor is described by a width, length, height, and z center, and is applied at two orientations: 0 and 90 degrees. Anchors are matched to ground truth using the 2D IoU with the following rules: A positive match is either the highest with a ground truth box, or above the positive match threshold, while a negative match is below the negative threshold. All other anchors are ignored (neutral) in the loss
Car. The x, y, z range is [(0, 70.4), (-40, 40), (-3, 1)] meters respectively. The car anchor has width, length, and height of (1.6, 3.9, 1.5) m with a z center of -1 m. Matching uses positive and negative thresholds of 0.6 and 0.45
Pedestrian & Cyclist. The x, y, z range of [(0, 48), (-20, 20), (-2.5, 0.5)] meters respectively. The pedestrian anchor has width, length, and height of (0.6, 0.8, 1.73) meters with a z center of -0.6 meters, while the cyclist anchor has width, length, and height of (0.6, 1.76, 1.73) meters with a z center of -0.6 meters. Matching uses positive and negative thresholds of 0.5 and 0.35

Post-processing setting

At inference time we apply axis aligned non maximum suppression (NMS) with an overlap threshold of 0.5 IoU

4.3. Data Augmentation

First, following SECOND, for each sample, we randomly select 15; 0; 8 ground truth samples for cars, pedestrians, and cyclists respectively and place them into the current point cloud (ground-truth database sampling). Next, all ground truth boxes are individually augmented. rotated and translated
Finally, we perform two sets of global augmentations that are jointly applied to the point cloud and all boxes. random mirroring flip along x axis, global rotation and global scaling, global translation

5. Results

Quantitative Analysis

Official KITTI evaluation detection metrics are: bird’s eye view (BEV), 3D, 2D, and average orientation similarity (AOS). The 2D detection is done in the image plane. Average orientation similarity assesses the average orientation (measured in BEV) similarity for 2D detections.
The KITTI dataset is stratified into easy, moderate, ranked by performance on moderate

Qualitative Analysis

6. Realtime Inference

As indicated by our results (Table 1 and Figure 5), PointPillars represent a significant improvement in terms of inference runtime. All runtimes are measured on a desktop with an Intel i7 CPU and a 1080ti GPU
The main inference steps are as follows. First, the point cloud is loaded and filtered based on range and visibility in the images (1.4 ms). Then, the points are organized in pillars and decorated (2.7 ms). Next, the PointPillar tensor is uploaded to the GPU (2.9 ms), encoded (1.3 ms), scattered to the pseudo-image (0.1 ms), and processed by the backbone and detection heads (7.7 ms). Finally NMS is applied on the CPU (0.1 ms) for a total runtime of 16.2 ms

Encoding: The key design to enable this runtime is the PointPilar encoding. For example, at 1:3 ms it is 2 orders of magnitude faster than the VoxelNet encoder (190 ms)
Slimmer Design: We opt for a single PointNet in our encoder, compared to 2 sequential PointNets suggested by VoxelNet.
TensorRT: Switching to TensorRT gave a 45.5% speedup from the PyTorch pipeline which runs at 42.4 Hz.
The Need for Speed: While it could be argued that such runtime is excessive since a lidar typically operates at 20 Hz

7. Ablation Studies

7.1. Spatial Resolution

A trade-off between speed and accuracy can be achieved by varying the size of the spatial binning. Smaller pillars allow finer localization and lead to more features, while larger pillars are faster due to fewer non-empty pillars (speeding up the encoder) and a smaller pseudo-image (speeding up the CNN backbone) Note: binning means how you set up the bins(grids)

7.2. Per Box Data Augmentation

Both VoxelNet [31] and SECOND [28] recommend extensive per box augmentation. However, in our experiments, minimal box augmentation worked better
Our hypothesis is: the introduction of ground truth sampling mitigates the need for extensive per box augmentation

7.3. Point Decorations

During the lidar point decoration step, we perform the VoxelNet [31] decorations plus two additional decorations: x p and yp which are the x and y offset from the pillar x; y center. Note: decoration means our added info to the points

7.4. Encoding

To assess the impact of the proposed PointPillar encoding in isolation, we implemented several encoders in the official codebase of SECOND [28]. Note: encoding means you encode those raw point clouds into a specific format, like here the key is encoding them into a pointpillar format

posted @ 2022-08-07 05:18 traviscui 阅读(148) 评论(0) 收藏举报

刷新页面返回顶部

Loading

快乐老家

记录学习和生活点滴

『论文』PointPillars

『论文』PointPillars

1. Introduction

1.1.1 Object detection using CNNs

1.1.2 Object detection in lidar point clouds

2. PointPillars Network

2.1. Pointcloud to Pseudo-Image

2.2. Backbone

2.3. Detection Head

3. Implementation Details

3.1. Network

3.2. Loss

4. Experimental setup

4.1. Dataset

4.2. Settings

4.3. Data Augmentation

5. Results

6. Realtime Inference

7. Ablation Studies

7.1. Spatial Resolution

7.2. Per Box Data Augmentation

7.3. Point Decorations

7.4. Encoding

公告

Loading

快乐老家

记录学习和生活点滴

『论文』PointPillars

『论文』PointPillars

1. Introduction

1.1 Related works

1.1.1 Object detection using CNNs

1.1.2 Object detection in lidar point clouds

2. PointPillars Network

2.1. Pointcloud to Pseudo-Image

2.2. Backbone

2.3. Detection Head

3. Implementation Details

3.1. Network

3.2. Loss

4. Experimental setup

4.1. Dataset

4.2. Settings

4.3. Data Augmentation

5. Results

6. Realtime Inference

7. Ablation Studies

7.1. Spatial Resolution

7.2. Per Box Data Augmentation

7.3. Point Decorations

7.4. Encoding

公告