DI-Fusion: Online Implicit 3D Reconstruction with Deep Priors 阅读笔记

Introduction

  • Previous Work

    • bundle adjustment or loop closure
    • VoxHashing with Signed Distance Function
    • deep geometry learning
  • Challenges

    • explicitly modeled against sensor noise or view occlusion
    • an accurate camera tracking formulation
    • an efficient surface mapping strategy
  • DI-Fusion

    • extend the original local implicit grids
    • adapt it into PLIVox
    • an additional uncertainty encoding
    • approximate gradient for solving the camera tracking problem efficiently
    • encoder-decoder network design

Dataset: 3D RGB-D benchmark (ICL-NUIM, ScanNet dataset),

Related Works

  • Online 3D Reconstruction and SLAM.

  • Learned Probabilistic Reconstruction.

  • Implicit Representation.

Method

Given a sequential RGB-D stream, DI-Fusion incrementally builds up a 3D scene based on a novel PLIVox representation. (implicitly parameterized by a neural network and encodes useful local scene priors effectively)

img

  • We represent the reconstructed 3D scene with PLIVoxs (Sec. 3.1).

  • Given input RGB-D frames, we first estimate the camera pose Tt by finding the best alignment between the current depth point cloud and the map (Sec. 3.2), then the depth observations are integrated (Sec. 3.3) for surface mapping.

  • Scene mesh can be extracted any time on demand at any resolution.

Note that both the camera tracking and surface mapping are performed directly on the deep implicit representation.

PLIVox Representation

The scene reconstructed is sparsely partitioned into evenly-spaced voxels (PLIVoxs).

V={vm=(cm,lm,wm)}cmR3(voxel centroid)lmRL(the latent vector encoding the scene priors)wmN(observation weight)

xR3 query its corresponding PLIVox index m(x):R3N+

The local coordinate of x in vm(x) is calculated as y=1a(xcm(x))[12,12]3 (a being the voxel size)

Probabilistic Signed Distance Function.

the output at every position y is not a SDF but a SDF distribution sp(|y)

Here we model the SDF distribution as a canonical Gaussian distribution N(μ,σ2)

We encode the PSDF with a latent vector lm using an encoder-decoder deep neural network Φ

Encoder-Decoder Neural Network.

Φ={ϕE (encoding sub-network),ϕD (decoding sub-network)}

img

  • ϕE is to convert the measurements from each depth point observation at frame t to obeservation latent vectors lmt

    Point measurement’s local coordinate y and normal direction n are transformed to an L-dimensional feature vector ϕE(y,n) using only FC (Fully Connected) layers.

    Then the feature vectors from multiple points are aggregated to one latent vector lmt using a mean-pooling layer.

  • ϕD transform the concatenation of the local coordinate y and the latent vector lm are taken as input and the output is a 2-tuple {μD,σD}, represents Guassian parameters described before.

    two latent vectors lmt and lm in ϕE and ϕD are different latent vectors.

Network Training.

Train the ϕE and ϕD jointly in an end-to-end fashion, setting lmtlm.

two datasets:

  • S={Sm} for encoder, which is a set of tuples Sm={(yi,ni)} for each PLIVOX vm which points yi and ni sampled from the scene surface

  • D={Dm} for decoder, which consists of tuples Dm={(yi,sgti)} where points yi are randomly sampled within a PLIVox using a strategy similar to ith sgti being the SDF at point yi.

The goal of training is to maximize the likelihood of the dataset D for all training PLIVoxs.

Specifically, the loss function Lm for each PLIVox vm is written as

Lm=(yi,sgti)DmlogN(sgti;μD(yi,lm),σD2(yi,lm))lm=1|Sm|(yi,ni)SmϕE(yi,ni)

we regularize the norm of the latent vector with a l2-loss which reflects the prior distributions of lm

The final loss function L is:

L=vmVLm+δ||lm||2

Camera Tracking

A frame-to-model camera tracking method. Learned deep priors have enough information of the 3D scene for an accurate camera pose estimation.

Formulate the PSGD function as an objective function for camera pose estimation, with an approximate gradient for the objective function over camera pose.

  • Tracking

    We denote the RGB-D observation at frame t as Ot={It (intensity),Dt (depth)}

    The depth measurement Dt can be re-projected to 3D as point measurements Pt=π(Dt) where π is the projection function and π is its inverse.

    Goal is to estimate Ot's camera pose TtSE(3) by optimizing the relative pose T(ξt)=exp((ξt))(ξtse(3)) between Ot and Ot1, i.e. Tt=Tt1T(ξt)

    The following objective function is minimized in our system

    E(ξt)=Esdf(ξt)+wEint(ξt)

    where Esdf(ξt) and Eint(ξt) are the SDF term and intensity term respectively, and w is a weight parameter.

  • SDF Term Esdf(ξt)

    Perform frame-to-model alignment of the point measurements Pt to the on-surface geometry decoded by V.

    We choose to minimize the signed distance value of each point in Pt when transformed by the optimized camera pose.

    The objective function is:

    Esdf(ξt)=ptPtρ(r(G(ξt,pt)))G(ξt,pt)=Tt1T(ξt)ptr(x)=μD(x,lm(x))σD(x,lm(x))

    One important step to optimize the SDF term is the com-
    putation of r()'s gradient with respect to ξt, i.e. rξt

    Treat σD to be constant during the local linearization

    rξt=1σDμD(,lm(x))x(Rt1)(T(ξt)pt)

    whete Rt1 is rotation part of Tt1 and p:=[I3,p]

  • Intensity Term Eint(ξt)

    It is defined as

    Eint(ξt)=uΩ(It[u]It1[π(T(ξt)π(u,Dt[u]))])2

    where Ω is the image domain.

    This intensity term takes effect when the SDF term fails in areas with fewer geometric details such as wall or floor.

Surface Mapping

After the camera pose of RGB-D observation Ot is estimated, we need to update the mapping from observation Ot based on the deep implicit representation, by fusing new scene geometry from new observations with noise, which is also referred to as geometry integration.

  • Geometry Integration.

    We perform the geometry integration by updating the geometry latent vector lm with the observation latent vector lmt encoded by the point measurements Pt.

    We transform Pt according to Tt and then estimate the normal of each point measurement, obtaining Xt={(xi,ni)}

    In each PLIVox vm, the point measurements YmtXt and observation latent vector using lmt=1wmt(y,n)YmtϕE(y,n)

    lm is then updated as:

    lmlmwmlmtwmtwm+wmt,wmwm+wmt

    where the weight wmt is set to the number of points within the PLIVox as |Ymt|.

  • Mesh Extraction

    Divide each PLIVox into equally-spaced volumetric grids and query the SDFs for each grid with the decode ϕD using the PLIVox's latent vector.

    Double each PLIVox’s domain such that the volumetric grids between neighboring PLIVoxs overlap with each other.

    The final SDF of each volumetric grid is trilinearly interpolated with the SDFs decoded from the overlapping PLIVoxs.


__EOF__

本文作者zjp_shadow
本文链接https://www.cnblogs.com/zjp-shadow/p/16021270.html
关于博主:评论和私信会在第一时间回复。或者直接私信我。
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
声援博主:如果您觉得文章对您有帮助,可以点击文章右下角推荐一下。您的鼓励是博主的最大动力!
posted @   zjp_shadow  阅读(250)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 单线程的Redis速度为什么快?
· 展开说说关于C#中ORM框架的用法!
· Pantheons:用 TypeScript 打造主流大模型对话的一站式集成库
· SQL Server 2025 AI相关能力初探
· 为什么 退出登录 或 修改密码 无法使 token 失效
历史上的今天:
2018-03-18 2-SAT 问题与解法小结
点击右上角即可分享
微信分享提示