[3d object detection] BEVFormer
paper: BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, 2022
1. Grid-shaped BEVqueries
We predefine a group of grid-shaped learnable parameters Q ∈ RH×W×C as the queries of BEVFormer, where H, W are the spatial shape of the BEV plane. To be specific, the query Qp ∈ R1×C located at p = (x, y) of Q is responsible for the corresponding grid cell region in the BEV plane. Each grid cell in the BEV plane corresponds to a real-world size of s meters. The center of BEV features corresponds to the position of the ego car by default(The NuScenes dataset camera detect range can be [-40m, -40m, -1m, 40m, 40m, 5.4m], which are symmetric.). Following common practices [ 14], we add learnable positional embedding to BEV queries Q before inputting them to BEVFormer.
2. Spatial cross-attention
each BEV query only interacts with image features in the regions of interest.
Step:
- First lift each query on the BEV plane to a pillar-like query, sample \(N_{ref}\) 3D reference points from the pillar.
- Project these points to 2D views as reference points.
- he real world location (x′, y′) corresponding to the query \(Q_p\) located at p = (x, y) of Q.
将栅格坐标转换到以车辆为中心的世界坐标。
- the objects located at (x′, y′) will appear at the height of z′ on the z-axis. So we predefine a set of anchor heights \(\{z^{'}_j\}^{N_{ref}}_{j=1}\) to make sure we can capture clues that appeared at different heights. In this way, for each query \(Q_p\), we obtain a pillar of 3D reference points \((x′, y′, z')^{N_{ref}}_{j=1}\).
- project the 3D reference points to different image views through the projection matrix of cameras. \(T_i \in R_{3×4}\) is the known projection matrix of the i-th camera.
- sample the features from the hit views \(V_{hit}\) around these reference points.
- Perform a weighted sum of the sampled features as the output of spatial cross-attention.
3. Temporal self-attention
each BEV query interacts with two features: the BEV queries at the current timestamp and the BEV features at the previous timestamp.
Step:
- Given the BEV queries Q at current timestamp t and history BEV features Bt−1 preserved at timestamp t−1, we first align Bt−1 to Q according to ego-motion to make the features at the same grid correspond to the same real-world location.
根据运动关系,将上一帧 BEV 特征对齐到 Q 的世界坐标系 - (It is challenging to construct the precise association of the same objects between the BEV features of different times) Model temporal connection between features through the temporal self-attention (TSA) layer.