DI-Fusion: Online Implicit 3D Reconstruction with Deep Priors 阅读笔记

Introduction

  • Previous Work

    • bundle adjustment or loop closure
    • VoxHashing with Signed Distance Function
    • deep geometry learning
  • Challenges

    • explicitly modeled against sensor noise or view occlusion
    • an accurate camera tracking formulation
    • an efficient surface mapping strategy
  • DI-Fusion

    • extend the original local implicit grids
    • adapt it into PLIVox
    • an additional uncertainty encoding
    • approximate gradient for solving the camera tracking problem efficiently
    • encoder-decoder network design

Dataset: 3D RGB-D benchmark (ICL-NUIM, ScanNet dataset),

Related Works

  • Online 3D Reconstruction and SLAM.

  • Learned Probabilistic Reconstruction.

  • Implicit Representation.

Method

Given a sequential RGB-D stream, DI-Fusion incrementally builds up a 3D scene based on a novel PLIVox representation. (implicitly parameterized by a neural network and encodes useful local scene priors effectively)

img

  • We represent the reconstructed 3D scene with PLIVoxs (Sec. 3.1).

  • Given input RGB-D frames, we first estimate the camera pose \(T^t\) by finding the best alignment between the current depth point cloud and the map (Sec. 3.2), then the depth observations are integrated (Sec. 3.3) for surface mapping.

  • Scene mesh can be extracted any time on demand at any resolution.

Note that both the camera tracking and surface mapping are performed directly on the deep implicit representation.

PLIVox Representation

The scene reconstructed is sparsely partitioned into evenly-spaced voxels (PLIVoxs).

\[\begin{aligned} \mathcal V &= \{v_m = (\mathbf c_m, \mathbf l_m, w_m)\}\\ c_m &\in \mathbb R^3 \text{(voxel centroid)}\\ \mathbf l_m &\in \mathbb R^L \text{(the latent vector encoding the scene priors)}\\ w_m &\in \mathbb N \text{(observation weight)} \end{aligned} \]

\(x \in \mathbb R^3\) query its corresponding PLIVox index \(m(x): \mathbb R^3 \mapsto \mathbb N^+\)

The local coordinate of \(x\) in \(v_{m(x)}\) is calculated as \(y = \frac 1a(x - c_{m(x)}) \in [-\frac 12, \frac 12]^3\) (\(a\) being the voxel size)

Probabilistic Signed Distance Function.

the output at every position \(\mathbf y\) is not a SDF but a SDF distribution \(s \sim p(\cdot | \mathbf y)\)

Here we model the SDF distribution as a canonical Gaussian distribution \(\mathcal N(\mu, \sigma^2)\)

We encode the PSDF with a latent vector \(\mathbf l_m\) using an encoder-decoder deep neural network \(\Phi\)

Encoder-Decoder Neural Network.

\(\Phi = \{\phi_E \text{ (encoding sub-network)}, \phi_D\text{ (decoding sub-network)}\}\)

img

  • \(\phi_E\) is to convert the measurements from each depth point observation at frame \(t\) to obeservation latent vectors \(\mathbf l_m^t\)

    Point measurement’s local coordinate \(\mathbf y\) and normal direction \(\mathbf n\) are transformed to an \(L\)-dimensional feature vector \(\phi_E(\mathbb y, \mathbb n)\) using only FC (Fully Connected) layers.

    Then the feature vectors from multiple points are aggregated to one latent vector \(\mathbb l^t_m\) using a mean-pooling layer.

  • \(\phi_D\) transform the concatenation of the local coordinate \(\mathbb y\) and the latent vector \(\mathbb l_m\) are taken as input and the output is a 2-tuple \(\{\mu_D,\sigma_D\}\), represents Guassian parameters described before.

    two latent vectors \(\mathbb l^t_m\) and \(\mathbb l_m\) in \(\phi_E\) and \(\phi_D\) are different latent vectors.

Network Training.

Train the \(\phi_E\) and \(\phi_D\) jointly in an end-to-end fashion, setting \(\mathbb l^t_m ≡ \mathbb l_m\).

two datasets:

  • \(\mathcal S = \{\mathcal S_m\}\) for encoder, which is a set of tuples \(\mathcal S_m = \{(\mathbf y_i, \mathbf n_i)\}\) for each PLIVOX \(v_m\) which points \(\mathbf y_i\) and \(\mathbf n_i\) sampled from the scene surface

  • \(\mathcal D = \{\mathcal D_m\}\) for decoder, which consists of tuples \(\mathcal D_m = \{(\mathbf y_i, s^{i}_{\mathrm{gt}})\}\) where points yi are randomly sampled within a PLIVox using a strategy similar to ith \(s^i_{\mathrm{gt}}\) being the SDF at point \(\mathbf y_i\).

The goal of training is to maximize the likelihood of the dataset D for all training PLIVoxs.

Specifically, the loss function \(\mathcal L_m\) for each PLIVox \(v_m\) is written as

\[\begin{aligned} \mathcal L_m &= - \sum_{(\mathbf y_i, s^i_{\mathrm{gt}}) \in \mathcal D_m} \log \mathcal N(s^{i}_{\mathrm{gt}}; \mu_D(\mathbf y_i, \mathbf l_m), \sigma_D^2(\mathbf y_i, \mathbf l_m))\\ \mathbf l_m &= \frac 1{|\mathcal S_m|} \sum_{(\mathbf y_i, \mathbf n_i) \in \mathcal S_m} \phi_E (\mathbf y_i, \mathbf n_i) \end{aligned} \]

we regularize the norm of the latent vector with a \(l_2\)-loss which reflects the prior distributions of \(\mathbf l_m\)

The final loss function \(\mathcal L\) is:

\[\mathcal L = \sum_{v_m \in \mathcal V} \mathcal L_m + \delta ||\mathbf l_m||^2 \]

Camera Tracking

A frame-to-model camera tracking method. Learned deep priors have enough information of the 3D scene for an accurate camera pose estimation.

Formulate the PSGD function as an objective function for camera pose estimation, with an approximate gradient for the objective function over camera pose.

  • Tracking

    We denote the RGB-D observation at frame \(t\) as \(\mathcal O^t = \{\mathcal I^t \text{ (intensity)}, \mathcal D^t \text{ (depth)}\}\)

    The depth measurement \(\mathcal D^t\) can be re-projected to 3D as point measurements \(\mathcal P^t = \pi'(\mathcal D^t)\) where \(\pi\) is the projection function and \(\pi'\) is its inverse.

    Goal is to estimate \(\mathcal O^t\)'s camera pose \(\mathbf T^t \in SE(3)\) by optimizing the relative pose \(T(\xi^t) = \exp((\xi^t)^\wedge ) (\xi^t \in se(3))\) between \(\mathcal O^t\) and \(\mathcal O^{t - 1}\), i.e. \(\mathbf T^t = \mathbf T^{t - 1} T(\xi^t)\)

    The following objective function is minimized in our system

    \[E(\xi^t) = E_{\mathrm{sdf}}(\xi^t) + w E_{\mathrm{int}}(\xi^t) \]

    where \(E_{\mathrm{sdf}}(\xi^t)\) and \(E_{\mathrm{int}}(\xi^t)\) are the SDF term and intensity term respectively, and \(w\) is a weight parameter.

  • SDF Term \(E_{\mathrm{sdf}}(\xi^t)\)

    Perform frame-to-model alignment of the point measurements \(\mathcal P^t\) to the on-surface geometry decoded by \(\mathcal V\).

    We choose to minimize the signed distance value of each point in \(\mathcal P^t\) when transformed by the optimized camera pose.

    The objective function is:

    \[\begin{aligned} E_{\mathrm{sdf}}(\xi^t) &= \sum_{\mathbf p^t \in \mathcal P^t} \rho(r(G(\xi^t, \mathbf p^t)))\\ G(\xi^t, \mathbf p^t) &= \mathbf T^{t-1} T(\xi^t) \mathbf p^t\\ r(x) &= \frac{\mu_D(x, \mathbf l_{m(x)})}{\sigma_D(x, \mathbf l_{m(x)})} \end{aligned} \]

    One important step to optimize the SDF term is the com-
    putation of \(r(\cdot)\)'s gradient with respect to \(\xi^t\), i.e. \(\frac{\partial r}{\partial \xi^t}\)

    Treat \(\sigma_D\) to be constant during the local linearization

    \[\frac{\partial r}{\partial \xi^t} = \frac{1}{\sigma_D} \frac{\partial \mu_D(\cdot, \mathbf l_{m(x)})}{\partial x} (\mathbf R^{t-1})^{\top} (T(\xi^t) \mathbf p^t)^{\bigodot} \]

    whete \(\mathbf R^{t - 1}\) is rotation part of \(\mathbf T^{t - 1}\) and \({\mathbf p}^{\bigodot} := [I_3, -p^{\wedge}]\)

  • Intensity Term \(E_{\mathrm{int}}(\xi^t)\)

    It is defined as

    \[E_{\mathrm{int}}(\xi^t) = \sum_{\mathbf u \in \Omega} (\mathcal I^t[\mathbf u] - \mathcal I^{t - 1}[\pi(T(\xi^t) \pi'(\mathbf u, \mathcal D^t[\mathbf u]))])^2 \]

    where \(\Omega\) is the image domain.

    This intensity term takes effect when the SDF term fails in areas with fewer geometric details such as wall or floor.

Surface Mapping

After the camera pose of RGB-D observation \(\mathcal O^t\) is estimated, we need to update the mapping from observation \(\mathcal O^t\) based on the deep implicit representation, by fusing new scene geometry from new observations with noise, which is also referred to as geometry integration.

  • Geometry Integration.

    We perform the geometry integration by updating the geometry latent vector \(\mathbf l_m\) with the observation latent vector \(\mathbf l^t_m\) encoded by the point measurements \(\mathcal P^t\).

    We transform \(\mathcal P^t\) according to \(\mathbf T^t\) and then estimate the normal of each point measurement, obtaining \(\mathcal X^t = \{(\mathbf x_i, \mathbf n_i)\}\)

    In each PLIVox \(v_m\), the point measurements \(\mathcal Y_m^t \subset \mathcal X^t\) and observation latent vector using \(\mathbf l_m^t = \frac{1}{w_m^t} \sum_{(\mathbf y, \mathbf n) \in \mathcal Y_m^t} \phi_E (\mathbf y, \mathbf n)\)

    \(l_m\) is then updated as:

    \[\mathbf l_m \leftarrow \frac{\mathbf l_mw_m \mathbf l_m^t w_m^t}{w_m + w_m^t}, w_m \leftarrow w_m + w_m^t \]

    where the weight \(w^t_m\) is set to the number of points within the PLIVox as \(|\mathcal Y^t_m|\).

  • Mesh Extraction

    Divide each PLIVox into equally-spaced volumetric grids and query the SDFs for each grid with the decode \(\phi_D\) using the PLIVox's latent vector.

    Double each PLIVox’s domain such that the volumetric grids between neighboring PLIVoxs overlap with each other.

    The final SDF of each volumetric grid is trilinearly interpolated with the SDFs decoded from the overlapping PLIVoxs.

posted @ 2022-03-18 12:32  zjp_shadow  阅读(229)  评论(0编辑  收藏  举报