Vins_Mono 论文翻译

英文题目	VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator
中文名称	VINS-Mono：一种稳健且通用的单目视觉-惯性状态估计器
发表时间	2018年07月27日
平台	IEEE Transactions on Robotics
作者	Tong Qin, Peiliang Li, and Shaojie Shen
来源	香港科技大学机器人研究所（HKUST Aerial Robotics Group ）
关键词	`单目VIO` `状态估计` `传感器融合` `SLAM`
paper && code && video	paper1 paper2 code video

摘要

A monocular visual-inertial system (VINS), consisting of a camera and a low-cost inertial measurement unit (IMU), forms the minimum sensor suite for metric six degrees-of-freedom (DOF) state estimation. However, the lack of direct distance measurement poses significant challenges in terms of IMU processing, estimator initialization, extrinsic calibration, and nonlinear optimization. In this work, we present VINS-Mono: a robust and versatile monocular visual-inertial state estimator. Our approach starts with a robust procedure for estimator initialization and failure recovery. A tightly-coupled, nonlinear optimization-based method is used to obtain high accuracy visual-inertial odometry by fusing pre-integrated IMU measurements and feature observations. A loop detection module, in combination with our tightly-coupled formulation, enables relocalization with minimum computation overhead. We additionally perform four degrees-of-freedom pose graph optimization to enforce global consistency. We validate the performance of our system on public datasets and real-world experiments and compare against other state-of-the-art algorithms. We also perform onboard closed-loop autonomous flight on the MAV platform and port the algorithm to an iOS-based demonstration. We highlight that the proposed work is a reliable, complete, and versatile system that is applicable for different applications that require high accuracy localization. We open source our implementations for both PCs \({}^{1}\) and iOS mobile devices \({}^{2}\) ,

单目视觉-惯性系统（VINS），由一个摄像头和低成本惯性测量单元（IMU）组成，构成了进行度量六自由度（DOF）状态估计的最小传感器套件。
然而，缺乏直接的距离测量，在IMU处理、估计器初始化、外参校准和非线性优化方面带来了重大挑战。
在这项工作中，我们提出了 VINS-Mono：一种稳健且通用的单目视觉-惯性状态估计器。
- 我们的方法从稳健的估计器初始化和故障恢复程序开始。
- 采用紧耦合的非线性优化方法，通过融合预积分的 IMU 测量和特征观测，获得高精度的视觉-惯性里程计。
- 结合我们的紧耦合公式，回环检测模块能够以最小的计算开销实现重定位。
- 我们还执行了四自由度位姿图优化，以保持全局一致性。
- 我们在公共数据集和实际世界实验中验证了系统的性能，并与其他最先进的算法进行了比较。
- 我们还在 MAV平台上进行了机载闭环自主飞行，并将算法移植到了基于iOS的演示。
- 我们强调，所提出的工作是一个可靠、完整且通用的系统，适用于需要高精度定位的不同应用。
- 我们开源了我们的实现，既适用于PCs \({}^{1}\) 也适用于iOS移动设备 \({}^{2}\)。

Index Terms-Monocular visual-inertial systems, state estimation, sensor fusion, simultaneous localization and mapping

索引关键词 -单目视觉-惯性系统，状态估计，传感器融合，SLAM

\({}^{1}\) https://github.com/HKUST-Aerial-Robotics/VINS-Mono

\({}^{2}\) https://github.com/HKUST-Aerial-Robotics/VINS-Mobile

I. INTRODUCTION

State estimation is undoubtedly the most fundamental module for a wide range of applications, such as robotic navigation, autonomous driving, virtual reality (VR), and augmented reality (AR). Approaches that use only a monocular camera have gained significant interests by the community due to their small size, low-cost, and easy hardware setup [1]- [5]. However, monocular vision-only systems are incapable of recovering the metric scale, therefore limiting their usage in real-world robotic applications. Recently, we see a growing trend of assisting the monocular vision system with a low-cost inertial measurement unit (IMU). The primary advantage of this monocular visual-inertial system (VINS) is to have the metric scale, as well as roll and pitch angles, all observable. This enables navigation tasks that require metric state estimates. In addition, the integration of IMU measurements can dramatically improve motion tracking performance by bridging the gap between losses of visual tracks due to illumination change, texture-less area, or motion blur. In fact, the monocular VINS not only widely available on mobile robots, drones, and mobile devices, it is also the minimum sensor setup for sufficient self and environmental perception.

状态估计无疑是广泛应用于各种场合的最基本模块，如机器人导航、自动驾驶、虚拟现实（VR）和增强现实（AR）。

仅使用单个摄像头的做法因其体积小、成本低、硬件设置简单而受到社区的广泛关注 [1]- [5]。
然而，仅使用单目视觉的系统无法恢复度量尺度，因此限制了它们在现实世界机器人应用中的使用。
最近，我们看到了一个将低成本惯性测量单元（IMU）辅助单目视觉系统的趋势。这种单目视觉-惯性系统（VINS）的主要优势在于具有度量尺度以及可观测的横滚角和俯仰角。这使得需要度量状态估计的导航任务成为可能。
此外，IMU 测量的集成可以显著提高运动跟踪性能，通过弥补因光照变化、无纹理区域或运动模糊导致的视觉跟踪损失。
实际上，单目VINS 不仅广泛应用于移动机器人、无人机和移动设备，还是满足充分自我感知和环境感知的最小传感器。

(a) 轨迹（蓝色）和特征位置（红色）

(b) 轨迹与Google地图叠加，以便视觉比较

Fig. 1. Outdoor experimental results of the proposed monocular visual-inertial state estimator. Data is collected by a hand-held monocular camera-IMU setup under normal walking condition. It includes two complete circles inside the field and two semicircles on the nearby driveway. Total trajectory length is 2.5 \(\mathrm{{km}}\) . A video of the experiment can be found in the multimedia attachment.

图1. 提出的单目视觉-惯性状态估计器户外实验结果。数据是通过手持单目相机-IMU设备在正常步行条件下收集的。它包括场地内的两个完整圆圈和附近车道上的两个半圆。总轨迹长度为2.5 \(\mathrm{{km}}\)。实验视频可以在多媒体附件中找到。

However, all these advantages come with a price. For monocular VINS, it is well known that acceleration excitation is needed to make metric scale observable. This implies that monocular VINS estimators cannot start from a stationary condition, but rather launch from an unknown moving state. Also recognizing the fact that visual-inertial systems are highly nonlinear, we see significant challenges in terms of estimator initialization. The existence of two sensors also makes camera-IMU extrinsic calibration critical. Finally, in order to eliminate long-term drift within an acceptable processing window, a complete system that includes visual-inertial odometry, loop detection, relocalization, and global optimization has to be developed.

然而，所有这些优势都有代价。

对于单目VINS来说，众所周知，需要加速度激励才能使度量尺度可观测。这意味着单目VINS估计器不能从静止状态开始，而应该从一个未知的运动状态启动。
同时，考虑到视觉-惯性系统高度非线性，我们在估计器初始化方面看到了重大挑战。
两个传感器的存在也使得相机-IMU 的外参校准至关重要。
最后，为了在可接受的处理窗口内消除长期漂移，必须开发一个完整的系统，包括视觉-惯性里程计、回环检测、重定位和全局优化。

To address all these issues, we propose VINS-Mono, a robust and versatile monocular visual-inertial state estimator. Our solution starts with on-the-fly estimator initialization. The same initialization module is also used for failure recovery. The core of our solution is a robust monocular visual-inertial odometry (VIO) based on tightly-coupled sliding window nonlinear optimization. The monocular VIO module not only provides accurate local pose, velocity, and orientation estimates, it also performs camera-IMU extrinsic calibration and IMU biases correction in an online fashion. Loops are detected using DBoW2 [6]. Relocalization is done in a tightly-coupled setting by feature-level fusion with the monocular VIO. This enables robust and accurate relocalization with minimum computation overhead. Finally, geometrically verified loops are added into a pose graph, and thanks to the observable roll and pitch angles from the monocular VIO, a four degrees-of-freedom (DOF) pose graph is performed to ensure global consistency.

为了解决这些问题，我们提出了 VINS-Mono，这是一种稳健且通用的单目视觉-惯性状态估计器。

我们的解决方案从实时初始化估计器开始。相同的初始化模块也用于故障恢复。
我们解决方案的核心是 基于紧耦合滑动窗口非线性优化的稳健单目视觉-惯性里程计（VIO）。
单目VIO模块不仅提供了精确的局部pose、速度和方向估计，还在线执行相机-IMU外参校准和 IMU偏差校正。
回环检测使用 DBoW2 [6]。
重定位是在紧耦合的设置下通过特征级融合与单目VIO完成的，这实现了计算开销最小化的稳健和精确重定位。
最后，经过几何验证的回环被添加到位姿图中，得益于单目VIO提供的可观测的滚转和俯仰角，执行了四自由度（DOF）位姿图优化以确保全局一致性。

VINS-Mono combines and improves the our previous works on monocular visual-inertial fusion [7]-[10]. It is built on top our tightly-coupled, optimization-based formulation for monocular VIO [7], [8], and incorporates the improved initialization procedure introduced in [9]. The first attempt of porting to mobile devices was given in [10]. Further improvements of VINS-Mono comparing to our previous works include improved IMU pre-integration with bias correction, tightly-coupled relocalization, global pose graph optimization, extensive experimental evaluation, and a robust and versatile open source implementation.

VINS-Mono 结合并改进了我们之前关于单目视觉-惯性融合的工作 [7]-[10]。它构建在我们为单目VIO 提出的紧耦合、基于优化的范式之上 [7], [8]，并集成了 [9] 中引入的改进初始化过程。首次尝试移植到移动设备的工作在 [10] 中给出。

与之前工作相比，VINS-Mono 的进一步改进包括

改进的IMU预积分和偏差校正、
紧耦合重定位、
全局位姿图优化、
广泛的实验评估
以及稳健且多功能的开源实现。

The whole system is complete and easy-to-use. It has been successfully applied to small-scale AR scenarios, medium-scale drone navigation, and large-scale state estimation tasks. Superior performance has been shown against other state-of-the-art methods. To this end, we summarize our contributions as follow:

整个系统完整且易于使用。它已经成功应用于小规模增强现实（AR）场景、中等规模无人机导航和大规模状态估计任务。与其他最先进的方法相比，展示了卓越的性能。为此，我们将我们的贡献概括如下：

A robust initialization procedure that is able to bootstrap the system from unknown initial states.
一种健壮的初始化过程，能够从未知初始状态启动系统。
A tightly-coupled, optimization-based monocular visual-inertial odometry with camera-IMU extrinsic calibration and IMU bias estimation.
一种紧耦合的、基于优化的单目视觉-惯性里程计，具有相机-IMU外参校准和IMU偏差估计。
Online loop detection and tightly-coupled relocalization.
在线回环检测和紧耦合的重定位。
Four DOF global pose graph optimization.
四自由度全局位姿图优化。
Real-time performance demonstration for drone navigation, large-scale localization, and mobile AR applications.
针对无人机导航、大规模定位和移动增强现实应用的实时性能演示。
Open-source release for both the PC version that is fully integrated with ROS, as well as the iOS version that runs on iPhone6s or above.
开源发布，包括与ROS完全集成的PC版本，以及在iPhone6s或以上版本上运行的iOS版本。

The rest of the paper is structured as follows. In Sect. II, we discuss relevant literature. We give an overview of the complete system pipeline in Sect. III. Preprocessing steps for both visual and pre-integrated IMU measurements are presented in Sect. IV. In Sect. V, we discuss the estimator initialization procedure. A tightly-coupled, self-calibrating, nonlinear optimization-based monocular VIO, is presented in Sect. VI Tightly-coupled relocalization and global pose graph optimization are presented in Sect. VII and Sect. VIII respectively. Implementation details and experimental results are shown in Sect. IX. Finally, the paper is concluded with a discussion and possible future research directions in Sect. X

本文的其余部分结构如下。

在第二节中，我们讨论相关文献。
我们在第三节中概述了完整的系统流程。
第四节介绍了视觉和预积分IMU测量的预处理步骤。
在第五节中，我们讨论了估计器初始化过程。
在第六节中，我们提出了一种紧耦合、自校准、基于非线性优化的单目VIO。
紧耦合的重定位和全局位姿图优化分别在第七节和第八节中介绍。
第九节展示了实现细节和实验结果。
最后，在第十节中，我们对本文进行了讨论，并提出了可能的未来研究方向。

Scholarly works on monocular vision-based state estimation/odometry/SLAM are extensive. Noticeable approaches include PTAM [1], SVO [2], LSD-SLAM [3], DSO [5], and ORB-SLAM [4]. It is obvious that any attempts to give a full relevant review would be incomplete. In this section, however, we skip the discussion on vision-only approaches, and only focus on the most relevant results on monocular visual-inertial state estimation.

关于基于单目视觉的状态估计/里程计/SLAM的学术研究非常广泛。值得注意的方法包括PTAM [1]、SVO [2]、LSD-SLAM [3]、DSO [5]和ORB-SLAM [4]。显然，任何试图给出全面相关回顾的尝试都将是片面的。然而，在本节中，我们跳过仅基于视觉的方法的讨论，仅关注单目视觉-惯性状态估计的最相关结果。

The simplest way to deal with visual and inertial measurements is loosely-coupled sensor fusion [11], [12], where IMU is treated as an independent module to assist vision-only pose estimates obtained from the visual structure from motion. Fusion is usually done by an extended Kalman filter (EKF), where IMU is used for state propagation and the vision-only pose is used for the update. Further on, tightly-coupled visual-inertial algorithms are either based on the EKF [13]-[15] or graph optimization [7], [8], [16], [17], where camera and IMU measurements are jointly optimized from the raw measurement level. A popular EKF based VIO approach is MSCKF [13], [14]. MSCKF maintains several previous camera poses in the state vector, and uses visual measurements of the same feature across multiple camera views to form multi-constraint update. SR-ISWF [18], [19] is an extension of MSCKF. It uses square-root form [20] to achieve single-precision representation and avoid poor numerical properties. This approach employs the inverse filter for iterative re-linearization, making it equal to optimization-based algorithms. Batch graph optimization or bundle adjustment techniques maintain and optimize all measurements to obtain the optimal state estimates. To achieve constant processing time, popular graph-based VIO methods [8], [16], [17] usually optimize over a bounded-size sliding window of recent states by marginalizing out past states and measurements. Due to high computational demands of iterative solving of nonlinear systems, few graph-based can achieve real-time performance on resource-constrained platforms, such as mobile phones.

处理视觉和惯性测量的最简单方法是松耦合传感器融合 [11]，[12]，在这种方法中，IMU 被视为一个独立模块，用于辅助仅基于视觉的运动结构得到的视觉位姿估计。融合通常通过扩展卡尔曼滤波器（EKF）完成，其中 IMU 用于状态传播，而仅视觉位姿用于更新。

进一步地，紧耦合的视觉惯性算法要么基于 EKF [13]-[15]，要么基于图优化 [7]，[8]，[16]，[17]，在基于图优化情况下，相机和 IMU 测量值从原始测量级别开始联合优化。

一种基于 EKF 的流行 VIO 方法是 MSCKF [13]，[14]。MSCKF 在状态向量中维护多个之前的相机位姿，并使用同一特征在不同相机视图之间的视觉测量来形成多约束更新。
SR-ISWF [18]，[19] 是 MSCKF 的扩展。它使用平方根形式 [20] 来实现单精度表示并避免数值性能不良。这种方法采用逆滤波器进行迭代重新线性化，使其等同于基于优化的算法。批量图优化或束调整技术维护并优化所有测量值，以获得最优状态估计。
为了实现恒定的处理时间，流行的基于图的 VIO 方法 [8]，[16]，[17] 通常通过边际化过去的态和测量值，在最近状态的有限大小滑动窗口上进行优化。由于迭代求解非线性系统的计算需求较高，因此在资源受限的平台（如手机）上，很少有基于图的算法能够实现实时性能。

For visual measurement processing, algorithms can be categorized into either direct or indirect method according to the definition of visual residual models. Direct approaches [2], [3], [21] minimize photometric error while indirect approaches [8], [14], [16] minimize geometric displacement. Direct methods require a good initial guess due to their small region of attraction, while indirect approaches consume extra computational resources on extracting and matching features. Indirect approaches are more frequently found in real-world engineering deployment due to its maturity and robustness. However, direct approaches are easier to be extended for dense mapping as they are operated directly on the pixel level.

对于视觉测量处理，算法可以根据视觉残差模型的定义分为直接方法或间接方法。

直接方法 [2]，[3]，[21] 最小化光度误差。由于吸引区域小，需要一个好的初始猜测。更容易扩展到稠密建图，因为它们直接在像素级别上操作
而间接方法 [8]，[14]，[16] 最小化几何位移。间接方法在提取和匹配特征时消耗额外的计算资源。成熟和鲁棒性，它在实际工程部署中更为常见。

In practice, IMUs usually acquire data at a much higher rate than the camera. Different methods have been proposed to handle the high rate IMU measurements. The most straightforward approach is to use the IMU for state propagation in EKF-based approaches [11], [13]. In a graph optimization formulation, an efficient technique called IMU pre-integration is developed in order to avoid repeated IMU re-integration This technique was first introduced in [22], which parametrize rotation error using Euler angles. An on-manifold rotation formulation for IMU-preintegration was developed in our previous work [7]. This work derived the covariance propagation using continuous-time IMU error state dynamics. However, IMU biases were ignored. The pre-integration theory was further improved in [23] by adding posterior IMU bias correction.

在实际中，IMUs 通常以比相机高得多的频率获取数据。已经提出了不同的方法来处理高频率的 IMU 测量。

最直接的方法是在基于 EKF 的方法 [11]，[13] 中使用 IMU 进行状态传播。
在图优化公式中，为了避免重复的 IMU 重新积分，开发了一种称为 IMU 预积分的有效技术。这项技术在 [22] 中首次被提出，它使用欧拉角来参数化旋转误差。
我们之前的工作 [7] 开发了一种适用于 IMU 预积分的流形上的旋转公式。这项工作推导了使用连续时间 IMU 错误状态动力学的协方差传播。
然而，IMU 偏差被忽略了。在 [23] 中，通过添加后验 IMU 偏差校正，进一步改进了预积分理论。

Accurate initial values are crucial to bootstrap any monocular VINS. A linear estimator initialization method that leverages relative rotations from short-term IMU pre-integration was proposed in [8], [24]. However, this method does not model gyroscope bias, and fails to model sensor noise in raw projection equations. In real-world applications, this results in unreliable initialization when visual features are far away from the sensor suite. A closed-form solution to the monocular visual-inertial initialization problem was introduced in [25]. Later, an extension to this closed-form solution by adding gyroscope bias calibration was proposed in [26]. These approaches fail to model the uncertainty in inertial integration since they rely on the double integration of IMU measurements over an extended period of time. In [27], a re-initialization and failure recovery algorithm based on SVO [2] was proposed. This is a practical method based on a loosely-coupled fusion framework. However, an additional downward-facing distance sensor is required to recover the metric scale. An initialization algorithm built on top of the popular ORB-SLAM [4] was introduced in [17]. An initial estimation of the scale, gravity direction, velocity and IMU biases are computed for the visual-inertial full BA given a set of keyframes from ORB-SLAM. However, it is reported that the time required for scale convergence can be longer than 10 seconds. This can pose problems for robotic navigation tasks that require scale estimates right at the beginning.

准确的初始值对于启动任何单目视觉惯性导航系统（VINS）至关重要。

在文献[8]，[24]中提出了一种利用短期IMU预积分得到的相对旋转的线性估计器初始化方法。然而，该方法没有建模陀螺仪偏差，并且未在原始投影方程中建模传感器噪声。在实际应用中，当视觉特征远离传感器套件时，这会导致不可靠的初始化。
文献[25]中介绍了单目视觉惯性初始化问题的闭式解。
后来，文献[26]中通过添加陀螺仪偏差校准对该闭式解进行了扩展。这些方法由于依赖长时间内IMU测量的双重积分，未能建模惯性积分中的不确定性。
在文献[27]中，提出了一种基于 SVO[2] 的重新初始化和故障恢复算法。这是一种基于松耦合融合坐标系的实用方法。然而，为了恢复度量尺度，需要额外的 downward-facing 距离传感器。
文献[17]中介绍了一种基于流行的ORB-SLAM[4]的初始化算法。给定ORB-SLAM的一组关键帧，为视觉惯性完整BA计算了尺度、重力方向、速度和IMU偏差的初始估计。然而，据报道，尺度收敛所需的时间可能超过10秒。这对于需要在开始时立即获得尺度估计的机器人导航任务来说可能是个问题。

VIO approaches, regardless the underlying mathematical formulation that they rely on, suffer from long term drifting in global translation and orientation. To this end, loop closure plays an important role for long-term operations. ORB-SLAM [4] is able to close loops and reuse the map, which takes advantage of Bag-of-World [6]. A 7 DOF [28] (position, orientation, and scale) pose graph optimization is followed loop detection. In contrast, for monocular VINS, thanks to the addition of IMU, drift only occurs in 4 DOF, which is the 3D translation, and the rotation around the gravity direction (yaw angle). Therefore, in this paper, we choose to optimize the pose graph with loop constraints in the minimum 4 DOF setting.

无论它们依赖的数学公式是什么，VIO 方法都存在长期漂移在全局平移和方向上的问题。为此，回环在长期运行中扮演了重要角色。

ORB-SLAM [4] 能够闭合环并重用地图，这利用了 Bag-of-World [6] 的优势。在回环检测之后，进行 7 自由度 [28]（位置、方向和尺度）姿态图优化。
相比之下，对于单目 VINS，由于加入了 IMU，漂移仅发生在 4 自由度，即 3D 平移和绕重力方向的旋转（偏航角）。因此，在本文中，我们选择在最小 4 自由度设置下，用回环约束优化姿态图。

III. Overview

Fig. 2. A block diagram illustrating the full pipeline of the proposed monocular visual-inertial state estimator.

图 2. 描述所提出单目视觉-惯性状态估计器完整流程的模块图。

The structure of proposed monocular visual-inertial state estimator is shown in Fig. 2. The system starts with measurement preprocessing (Sect. IV), in which features are extracted and tracked, and IMU measurements between two consecutive frames are pre-integrated. The initialization procedure (Sect. V provides all necessary values, including pose, velocity, gravity vector,gyroscope bias,and 3D feature location, for bootstrapping the subsequent nonlinear optimization-based VIO. The VIO (Sect. VI) with relocalization (Sect. VII) modules tightly fuses pre-integrated IMU measurements, feature observations, and re-detected features from loop closure. Finally, the pose graph optimization module (Sect. VIII) takes in geometrically verified relocalization results, and perform global optimization to eliminate drift. The VIO, relocalization, and pose graph optimization modules run concurrently in a multi-thread setting. Each module has different running rates and real-time guarantees to ensure reliable operation at all times.

所提出单目视觉-惯性状态估计器的结构如图 2 所示。

系统从测量预处理（第 IV 节）开始，其中提取并跟踪特征，并在连续两帧之间预积分 IMU 测量值。
初始化过程（第 V 节）提供了所有必要值，包括姿态、速度、重力矢量、陀螺仪偏差和 3D 特征位置，以启动后续基于非线性优化的 VIO。
VIO（第 VI 节）与重定位（第 VII 节）模块紧密融合预积分 IMU 测量值、特征观测和回环重检测的特征。
最后，姿态图优化模块（第 VIII 节）接收几何验证的重定位结果，并执行全局优化以消除漂移。
VIO、重定位和姿态图优化模块在多线程设置中并行运行。每个模块具有不同的运行速率和实时保证，以确保始终可靠地运行。

We now define notations and frame definitions that we use throughout the paper. We consider \({\left( \cdot \right) }^{w}\) as the world frame. The direction of the gravity is aligned with the \(z\) axis of the world frame. \({\left( \cdot \right) }^{b}\) is the body frame,which we define it to be the same as the IMU frame. \({\left( \cdot \right) }^{c}\) is the camera frame. We use both rotation matrices \(\mathbf{R}\) and Hamilton quaternions \(\mathbf{q}\) to represent rotation. We primarily use quaternions in state vectors, but rotation matrices are also used for convenience rotation of 3D vectors. \({\mathbf{q}}_{b}^{w},{\mathbf{p}}_{b}^{w}\) are rotation and translation from the body frame to the world frame. \({b}_{k}\) is the body frame while taking the \({k}^{th}\) image. \({c}_{k}\) is the camera frame while taking the \({k}^{\text{th }}\) image. \(\otimes\) represents the multiplication operation between two quaternions. \({\mathbf{g}}^{w} = {\left\lbrack 0,0,g\right\rbrack }^{T}\) is the gravity vector in the world frame. Finally,we denote \(\left( \cdot \right)\) as the noisy measurement or estimate of a certain quantity.

现在我们定义本文中使用的符号和坐标系定义。

我们将 \({\left( \cdot \right) }^{w}\) 视为世界坐标系。重力的方向与世界坐标系的 \(z\) 轴对齐。
\({\left( \cdot \right) }^{b}\) 是机体坐标系，我们将其定义为与 IMU 坐标系相同。
\({\left( \cdot \right) }^{c}\) 是相机坐标系。
我们使用旋转矩阵 \(\mathbf{R}\) 和 Hamilton 四元数 \(\mathbf{q}\) 来表示旋转。我们主要在状态向量中使用四元数，但旋转矩阵也用于方便地旋转三维向量。
\({\mathbf{q}}_{b}^{w},{\mathbf{p}}_{b}^{w}\) 表示从机体坐标系到世界坐标系的旋转和平移。
\({b}_{k}\) 是在拍摄 \({k}^{th}\) 图像时的机体坐标系。
\({c}_{k}\) 是在拍摄 \({k}^{\text{th }}\) 图像时的相机坐标系。
\(\otimes\) 表示两个四元数之间的乘法运算。
\({\mathbf{g}}^{w} = {\left\lbrack 0,0,g\right\rbrack }^{T}\) 是在世界坐标系中的重力向量。
最后，我们用 \(\left( \cdot \right)\) 表示对某一量的噪声测量或估计。

IV. Measurement Preprocessing

This section presents preprocessing steps for both inertial and monocular visual measurements. For visual measurements, we track features between consecutive frames and detect new features in the latest frame. For IMU measurements, we pre-integrate them between two consecutive frames. Note that the measurements of the low-cost IMU that we use are affected by both bias and noise. We therefore especially take bias into account in the IMU pre-integration process.

本节介绍了对惯性测量和单目视觉测量的预处理步骤。

对于视觉测量，我们在连续帧之间跟踪特征并在最新帧中检测新特征。
对于 IMU 测量，我们在两个连续帧之间进行预积分。

注意，我们使用的低成本 IMU 的测量受到偏差和噪声的影响。因此，我们在 IMU 预积分过程中特别考虑了偏差。

A. Vision Processing Front-end

For each new image, existing features are tracked by the KLT sparse optical flow algorithm [29]. Meanwhile, new corner features are detected [30] to maintain a minimum number (100-300) of features in each image. The detector enforces a uniform feature distribution by setting a minimum separation of pixels between two neighboring features. 2D Features are firstly undistorted and then projected to a unit sphere after passing outlier rejection. Outlier rejection is performed using RANSAC with fundamental matrix model [31].

对于每张新图像，现有的特征通过 KLT稀疏光流算法 [29] 进行跟踪。
同时，检测新的角点特征 [30] 以保持每张图像中特征的最小数量（100-300）。
检测器通过设置两个相邻特征之间的最小像素间隔来强制特征均匀分布。
2D特征首先去畸变，然后在通过异常值剔除后投影到单位球上。
异常值剔除是通过使用 RANSAC 和基本矩阵模型 [31] 来执行的。

Keyframes are also selected in this step. We have two criteria for keyframe selection. The first one is the average parallax apart from the previous keyframe. If the average parallax of tracked features is between the current frame and the latest keyframe is beyond a certain threshold, we treat frame as a new keyframe. Note that not only translation but also rotation can cause parallax. However, features cannot be triangulated in the rotation-only motion. To avoid this situation, we use short-term integration of gyroscope measurements to compensate rotation when calculating parallax. Note that this rotation compensation is only used to keyframe selection, and is not involved in rotation calculation in the VINS formulation. To this end, even if the gyroscope contains large noise or is biased, it will only result in suboptimal keyframe selection results, and will not directly affect the estimation quality. Another criterion is tracking quality. If the number of tracked features goes below a certain threshold, we treat this frame as a new keyframe. This criterion is to avoid complete loss of feature tracks.

关键帧也在这一步被选中。我们有两个关键帧选择标准。

第一个是除了前一个关键帧之外的视差平均值。（平均视差法）
- 如果跟踪特征在 当前帧和最新关键帧之间的平均视差超过了某个阈值，我们就将 这一帧视为新的关键帧。
- 请注意，不仅平移，旋转也会导致视差。然而，在只有旋转的运动中，特征无法被三角化。为了避免这种情况，我们在计算视差时使用短时集成陀螺仪测量来补偿旋转。请注意，这种旋转补偿仅用于关键帧选择，并不参与VINS公式中的旋转计算。因此，即使陀螺仪包含大量噪声或有偏差，也只会导致关键帧选择结果次优，而不会直接影响估计质量。
另一个标准是跟踪质量。如果跟踪特征的数量低于某个阈值，我们就将这一帧视为新的关键帧。这个标准是为了避免特征轨迹的完全丢失。

B. IMU Pre-integration

IMU Pre-integration was first proposed in [22], which parametrized rotation error in Euler angle. An on-manifold rotation formulation for IMU pre-integration was developed in our previous work [7]. This work derived the covariance propagation using continuous-time IMU error state dynamics. However, IMU biases were ignored. The pre-integration theory was further improved in [23] by adding posterior IMU bias correction. In this paper, we extend the IMU pre-integration proposed in our previous work [7] by incorporating IMU bias correction.

IMU 预积分首次在[22]中提出，该文使用欧拉角参数化了旋转误差。
我们在之前的工作[7]中为 IMU预积分开发了一种在流形上的旋转公式。这项工作推导了使用连续时间IMU误差状态动力学的协方差传播。然而，IMU偏差被忽略了。
在[23]中，通过添加后验IMU偏差校正，进一步改进了预积分理论。
在本文中，我们通过引入 IMU偏差校正，扩展了我们之前工作[7]中提出的 IMU预积分。

The raw gyroscope and accelerometer measurements from IMU, \(\widehat{\omega }\) and \(\widehat{\mathrm{a}}\) ,are given by:

来自IMU的原始陀螺仪和加速度计测量值\(\widehat{\omega }\)和\(\widehat{\mathrm{a}}\)，分别为：

\[{\widehat{\mathbf{a}}}_{t} = {\mathbf{a}}_{t} + {\mathbf{b}}_{{a}_{t}} + {\mathbf{R}}_{w}^{t}{\mathbf{g}}^{w} + {\mathbf{n}}_{a} \tag{1} \]

\[{\widehat{\mathbf{\omega }}}_{t} = {\mathbf{\omega }}_{t} + {\mathbf{b}}_{{w}_{t}} + {\mathbf{n}}_{w} \]

IMU measurements, which are measured in the body frame, combines the force for countering gravity and the platform dynamics,and are affected by acceleration bias \({\mathbf{b}}_{a}\) ,gyroscope bias \({\mathbf{b}}_{w}\) ,and additive noise. We assume that the additive noise in acceleration and gyroscope measurements are Gaussian, \({\mathbf{n}}_{a} \sim \mathcal{N}\left( {\mathbf{0},{\mathbf{\sigma }}_{a}^{2}}\right) ,{\mathbf{n}}_{w} \sim \mathcal{N}\left( {\mathbf{0},{\mathbf{\sigma }}_{w}^{2}}\right)\) . Acceleration bias and gyroscope bias are modeled as random walk, whose derivatives are Gaussian, \({\mathbf{n}}_{{b}_{a}} \sim \mathcal{N}\left( {\mathbf{0},{\mathbf{\sigma }}_{{b}_{a}}^{2}}\right) ,{\mathbf{n}}_{{b}_{w}} \sim \mathcal{N}\left( {\mathbf{0},{\mathbf{\sigma }}_{{b}_{w}}^{2}}\right)\) :

IMU测量值是在机体坐标系中测量的，它结合了对抗重力的力和平台动力学，并受到加速度偏差\({\mathbf{b}}_{a}\)、陀螺仪偏差\({\mathbf{b}}_{w}\)和加性噪声的影响。
我们假设加速度和陀螺仪测量中的加性噪声是高斯分布的\({\mathbf{n}}_{a} \sim \mathcal{N}\left( {\mathbf{0},{\mathbf{\sigma }}_{a}^{2}}\right) ,{\mathbf{n}}_{w} \sim \mathcal{N}\left( {\mathbf{0},{\mathbf{\sigma }}_{w}^{2}}\right)\)。
加速度偏差和陀螺仪偏差被建模为随机游走，其导数为高斯分布\({\mathbf{n}}_{{b}_{a}} \sim \mathcal{N}\left( {\mathbf{0},{\mathbf{\sigma }}_{{b}_{a}}^{2}}\right) ,{\mathbf{n}}_{{b}_{w}} \sim \mathcal{N}\left( {\mathbf{0},{\mathbf{\sigma }}_{{b}_{w}}^{2}}\right)\)：

\[{\dot{\mathbf{b}}}_{{a}_{t}} = {\mathbf{n}}_{{b}_{a}},\;{\dot{\mathbf{b}}}_{{w}_{t}} = {\mathbf{n}}_{{b}_{w}}. \tag{2} \]

Given two time instants that correspond to image frames \({b}_{k}\) and \({b}_{k + 1}\) ,position,velocity,and orientation states can be propagated by inertial measurements during time interval \(\left\lbrack {{t}_{k},{t}_{k + 1}}\right\rbrack\) in the world frame:

给定对应于图像帧\({b}_{k}\)和\({b}_{k + 1}\)的两个时间点，位置、速度和方向状态可以通过在时间间隔\(\left\lbrack {{t}_{k},{t}_{k + 1}}\right\rbrack\)内的惯性测量在世界坐标系中传播：

\[{\mathbf{p}}_{{b}_{k + 1}}^{w} = {\mathbf{p}}_{{b}_{k}}^{w} + {\mathbf{v}}_{{b}_{k}}^{w}\Delta {t}_{k} \]

\[+ {\iint }_{t \in \left\lbrack {{t}_{k},{t}_{k + 1}}\right\rbrack }\left( {{\mathbf{R}}_{t}^{w}\left( {{\widehat{\mathbf{a}}}_{t} - {\mathbf{b}}_{{a}_{t}} - {\mathbf{n}}_{a}}\right) - {\mathbf{g}}^{w}}\right) d{t}^{2} \]

\[{\mathbf{v}}_{{b}_{k + 1}}^{w} = {\mathbf{v}}_{{b}_{k}}^{w} + {\int }_{t \in \left\lbrack {{t}_{k},{t}_{k + 1}}\right\rbrack }\left( {{\mathbf{R}}_{t}^{w}\left( {{\widehat{\mathbf{a}}}_{t} - {\mathbf{b}}_{{a}_{t}} - {\mathbf{n}}_{a}}\right) - {\mathbf{g}}^{w}}\right) {dt} \]

\[{\mathbf{q}}_{{b}_{k + 1}}^{w} = {\mathbf{q}}_{{b}_{k}}^{w} \otimes {\int }_{t \in \left\lbrack {{t}_{k},{t}_{k + 1}}\right\rbrack }\frac{1}{2}\mathbf{\Omega }\left( {{\widehat{\mathbf{\omega }}}_{t} - {\mathbf{b}}_{{w}_{t}} - {\mathbf{n}}_{w}}\right) {\mathbf{q}}_{t}^{{b}_{k}}{dt} \tag{3} \]

where

其中

\[\mathbf{\Omega }\left( \mathbf{\omega }\right) = \left\lbrack \begin{matrix} - \lfloor \mathbf{\omega }{\rfloor }_{ \times } & \mathbf{\omega } \\ - {\mathbf{\omega }}^{T} & 0 \end{matrix}\right\rbrack ,\lfloor \mathbf{\omega }{\rfloor }_{ \times } = \left\lbrack \begin{matrix} 0 & - {\omega }_{z} & {\omega }_{y} \\ {\omega }_{z} & 0 & - {\omega }_{x} \\ - {\omega }_{y} & {\omega }_{x} & 0 \end{matrix}\right\rbrack . \tag{4} \]

\(\Delta {t}_{k}\) is the duration between the time interval \(\left\lbrack {{t}_{k},{t}_{k + 1}}\right\rbrack\) .

\(\Delta {t}_{k}\)是时间间隔\(\left\lbrack {{t}_{k},{t}_{k + 1}}\right\rbrack\)的持续时间。

It can be seen that the IMU state propagation requires rotation,position and velocity of frame \({b}_{k}\) . When these starting states change, we need to re-propagate IMU measurements. Especially in the optimization-based algorithm, every time we adjust poses, we will need to re-propagate IMU measurements between them. This propagation strategy is computationally demanding. To avoid re-propagation, we adopt pre-integration algorithm.

可以看出，IMU状态传播需要旋转、位置和帧\({b}_{k}\)的速度。当这些初始状态发生变化时，我们需要重新传播IMU测量值。特别是在基于优化的算法中，每次我们调整姿态时，都需要重新传播它们之间的IMU测量值。这种传播策略在计算上是昂贵的。为了避免重新传播，我们采用了预积分算法。

After change the reference frame from the world frame to the local frame \({b}_{k}\) ,we can only pre-integrate the parts which are related to linear acceleration \(\widehat{\mathbf{a}}\) and angular velocity \(\widehat{\mathbf{\omega }}\) as follows:

在将参考系从世界坐标系变换到本地坐标系 \({b}_{k}\) 之后，我们可以仅对与线性加速度 \(\widehat{\mathbf{a}}\) 和角速度 \(\widehat{\mathbf{\omega }}\) 相关的部分进行预积分，如下所示：

\[{\mathbf{R}}_{w}^{{b}_{k}}{\mathbf{p}}_{{b}_{k + 1}}^{w} = {\mathbf{R}}_{w}^{{b}_{k}}\left( {{\mathbf{p}}_{{b}_{k}}^{w} + {\mathbf{v}}_{{b}_{k}}^{w}\Delta {t}_{k} - \frac{1}{2}{\mathbf{g}}^{w}\Delta {t}_{k}^{2}}\right) + {\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}} \]

\[{\mathbf{R}}_{w}^{{b}_{k}}{\mathbf{v}}_{{b}_{k + 1}}^{w} = {\mathbf{R}}_{w}^{{b}_{k}}\left( {{\mathbf{v}}_{{b}_{k}}^{w} - {\mathbf{g}}^{w}\Delta {t}_{k}}\right) + {\mathbf{\beta }}_{{b}_{k + 1}}^{{b}_{k}} \]

\[{\mathbf{q}}_{w}^{{b}_{k}} \otimes {\mathbf{q}}_{{b}_{k + 1}}^{w} = {\gamma }_{{b}_{k + 1}}^{{b}_{k}}, \tag{5} \]

Fig. 3. An illustration of the sliding window monocular VIO with relocalization. It is a tightly-coupled formulation with IMU, visual, and loop measurements.

图 3. 滑动窗口单目视觉惯性里程计（VIO）重定位的示意图。它是一个与 IMU、视觉和循环测量紧耦合的公式。

where,

其中，

\[{\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}} = {\iint }_{t \in \left\lbrack {{t}_{k},{t}_{k + 1}}\right\rbrack }{\mathbf{R}}_{t}^{{b}_{k}}\left( {{\widehat{\mathbf{a}}}_{t} - {\mathbf{b}}_{{a}_{t}} - {\mathbf{n}}_{a}}\right) d{t}^{2} \]

\[{\mathbf{\beta }}_{{b}_{k + 1}}^{{b}_{k}} = {\int }_{t \in \left\lbrack {{t}_{k},{t}_{k + 1}}\right\rbrack }{\mathbf{R}}_{t}^{{b}_{k}}\left( {{\widehat{\mathbf{a}}}_{t} - {\mathbf{b}}_{{a}_{t}} - {\mathbf{n}}_{a}}\right) {dt} \tag{6} \]

\[{\mathbf{\gamma }}_{{b}_{k + 1}}^{{b}_{k}} = {\int }_{t \in \left\lbrack {{t}_{k},{t}_{k + 1}}\right\rbrack }\frac{1}{2}\mathbf{\Omega }\left( {{\widehat{\mathbf{\omega }}}_{t} - {\mathbf{b}}_{{w}_{t}} - {\mathbf{n}}_{w}}\right) {\mathbf{\gamma }}_{t}^{{b}_{k}}{dt}. \]

It can be seen that the pre-integration terms 6 can be obtained solely with IMU measurements by taking \({b}_{k}\) as the reference frame. \({\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}},{\mathbf{\beta }}_{{b}_{k + 1}}^{{b}_{k}},{\mathbf{\gamma }}_{{b}_{k + 1}}^{{b}_{k}}\) are only related to IMU biases instead of other states in \({b}_{k}\) and \({b}_{k + 1}\) . When the estimation of bias changes, if the change is small, we adjust \({\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}},{\mathbf{\beta }}_{{b}_{k + 1}}^{{b}_{k}},{\mathbf{\gamma }}_{{b}_{k + 1}}^{{b}_{k}}\) by their first-order approximations with respect to the bias, otherwise we do re-propagation. This strategy saves a significant amount of computational resources for optimization-based algorithms, since we do not need to propagate IMU measurements repeatedly.

可以看到，预积分项（6）可以仅通过 IMU 测量获得，以 \({b}_{k}\) 作为参考系。\({\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}},{\mathbf{\beta }}_{{b}_{k + 1}}^{{b}_{k}},{\mathbf{\gamma }}_{{b}_{k + 1}}^{{b}_{k}}\) 仅与 IMU 偏差相关，而不是 \({b}_{k}\) 和 \({b}_{k + 1}\) 中的其他状态。当偏差的估计发生变化时，如果变化较小，我们通过相对于偏差的一阶近似来调整 \({\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}},{\mathbf{\beta }}_{{b}_{k + 1}}^{{b}_{k}},{\mathbf{\gamma }}_{{b}_{k + 1}}^{{b}_{k}}\)，否则我们进行重新传播。这种策略为基于优化的算法节省了大量计算资源，因为我们不需要重复传播 IMU 测量。

For discrete-time implementation, different numerical integration methods such as Euler, mid-point, RK4 integration can be applied. Here Euler integration is chosen to demonstrate the procedure for easy understanding (we use mid-point integration in the implementation code).

对于离散时间实现，可以应用不同的数值积分方法，如欧拉法、中点法、RK4 积分。这里选择欧拉积分来演示易于理解的过程（我们在实现代码中使用中点积分）。

At the beginning, \({\mathbf{\alpha }}_{{b}_{k}}^{{b}_{k}},{\mathbf{\beta }}_{{b}_{k}}^{{b}_{k}}\) is \(\mathbf{0}\) ,and \({\mathbf{\gamma }}_{{b}_{k}}^{{b}_{k}}\) is identity quaternion. The mean of \(\mathbf{\alpha },\mathbf{\beta },\mathbf{\gamma }\) in (6) is propagated step by step as follows. Note that the additive noise terms \({\mathbf{n}}_{a},{\mathbf{n}}_{w}\) are unknown, and is treated as zero in the implementation. This results in estimated values of the pre-integration terms, marked by \(\left( \cdot \right)\) :

在开始时，\({\mathbf{\alpha }}_{{b}_{k}}^{{b}_{k}},{\mathbf{\beta }}_{{b}_{k}}^{{b}_{k}}\) 是 \(\mathbf{0}\)，\({\mathbf{\gamma }}_{{b}_{k}}^{{b}_{k}}\) 是单位四元数。方程（6）中 \(\mathbf{\alpha },\mathbf{\beta },\mathbf{\gamma }\) 的平均值按以下步骤逐步传播。注意，加性噪声项 \({\mathbf{n}}_{a},{\mathbf{n}}_{w}\) 是未知的，在实现中将其视为零。这导致了预积分项的估计值，标记为 \(\left( \cdot \right)\)：

\[{\widehat{\mathbf{\alpha }}}_{i + 1}^{{b}_{k}} = {\widehat{\mathbf{\alpha }}}_{i}^{{b}_{k}} + {\widehat{\mathbf{\beta }}}_{i}^{{b}_{k}}{\delta t} + \frac{1}{2}\mathbf{R}\left( {\widehat{\mathbf{\gamma }}}_{i}^{{b}_{k}}\right) \left( {{\widehat{\mathbf{a}}}_{i} - {\mathbf{b}}_{{a}_{i}}}\right) \delta {t}^{2} \]

\[{\widehat{\mathbf{\beta }}}_{i + 1}^{{b}_{k}} = {\widehat{\mathbf{\beta }}}_{i}^{{b}_{k}} + \mathbf{R}\left( {\widehat{\mathbf{\gamma }}}_{i}^{{b}_{k}}\right) \left( {{\widehat{\mathbf{a}}}_{i} - {\mathbf{b}}_{{a}_{i}}}\right) {\delta t} \tag{7} \]

\[{\widehat{\mathbf{\gamma }}}_{i + 1}^{{b}_{k}} = {\widehat{\mathbf{\gamma }}}_{i}^{{b}_{k}} \otimes \left\lbrack \begin{matrix} 1 \\ \frac{1}{2}\left( {{\widehat{\mathbf{\omega }}}_{i} - {\mathbf{b}}_{{w}_{i}}}\right) {\delta t} \end{matrix}\right\rbrack \]

\(i\) is discrete moment corresponding to a IMU measurement within \(\left\lbrack {{t}_{k},{t}_{k + 1}}\right\rbrack\) . \({\delta t}\) is the time interval between two IMU measurements \(i\) and \(i + 1\) .

\(i\) 是对应于 \(\left\lbrack {{t}_{k},{t}_{k + 1}}\right\rbrack\) 内的 IMU 测量的离散矩。\({\delta t}\) 是两个 IMU 测量 \(i\) 和 \(i + 1\) 之间的时间间隔。

Then we deal with the covariance propagation. Since the four-dimensional rotation quaternion \({\gamma }_{t}^{{b}_{k}}\) is over-parameterized, we define its error term as a perturbation around its mean:

然后我们处理协方差传播。由于四维旋转四元数 \({\gamma }_{t}^{{b}_{k}}\) 是过参数化的，我们定义其误差项为围绕其均值的扰动：

\[{\mathbf{\gamma }}_{t}^{{b}_{k}} \approx {\widehat{\mathbf{\gamma }}}_{t}^{{b}_{k}} \otimes \left\lbrack \begin{matrix} 1 \\ \frac{1}{2}\delta {\mathbf{\theta }}_{t}^{{b}_{k}} \end{matrix}\right\rbrack \tag{8} \]

where \(\delta {\mathbf{\theta }}_{t}^{{b}_{k}}\) is three-dimensional small perturbation.

其中 \(\delta {\mathbf{\theta }}_{t}^{{b}_{k}}\) 是三维小扰动。

We can derive continuous-time linearized dynamics of error terms of 6 :

我们可以推导出（6）的误差项的连续时间线性化动力学：

\[\left\lbrack \begin{matrix} \delta {\dot{\mathbf{\alpha }}}_{t}^{{b}_{k}} \\ \delta {\dot{\mathbf{\beta }}}_{t}^{{b}_{k}} \\ \delta {\dot{\mathbf{\theta }}}_{t}^{{b}_{k}} \\ \delta {\dot{\mathbf{b}}}_{{a}_{t}} \\ \delta {\dot{\mathbf{b}}}_{{w}_{t}} \end{matrix}\right\rbrack = \left\lbrack \begin{matrix} 0 & \mathbf{I} & 0 & 0 & 0 \\ 0 & 0 & - {\mathbf{R}}_{t}^{{b}_{k}}\lfloor {\widehat{\mathbf{a}}}_{t} - {\mathbf{b}}_{{a}_{t}}{\rfloor }_{ \times } & - {\mathbf{R}}_{t}^{{b}_{k}} & 0 \\ 0 & 0 & - \lfloor {\widehat{\mathbf{\omega }}}_{t} - {\mathbf{b}}_{{w}_{t}}{\rfloor }_{ \times } & 0 & - \mathbf{I} \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{matrix}\right\rbrack \left\lbrack \begin{matrix} \delta {\mathbf{\alpha }}_{t}^{{b}_{k}} \\ \delta {\mathbf{\beta }}_{t}^{{b}_{k}} \\ \delta {\mathbf{\theta }}_{t}^{{b}_{k}} \\ \delta {\mathbf{b}}_{{a}_{t}} \\ \delta {\mathbf{b}}_{{w}_{t}} \end{matrix}\right\rbrack \]

\[+ \left\lbrack \begin{matrix} 0 & 0 & 0 & 0 \\ - {\mathbf{R}}_{t}^{{b}_{k}} & 0 & 0 & 0 \\ 0 & - \mathbf{I} & 0 & 0 \\ 0 & 0 & \mathbf{I} & 0 \\ 0 & 0 & 0 & \mathbf{I} \end{matrix}\right\rbrack \left\lbrack \begin{matrix} {\mathbf{n}}_{a} \\ {\mathbf{n}}_{w} \\ {\mathbf{n}}_{{b}_{a}} \\ {\mathbf{n}}_{{b}_{w}} \end{matrix}\right\rbrack = {\mathbf{F}}_{t}\delta {\mathbf{z}}_{t}^{{b}_{k}} + {\mathbf{G}}_{t}{\mathbf{n}}_{t} \tag{9} \]

\({\mathbf{P}}_{{b}_{k + 1}}^{{b}_{k}}\) can be computed recursively by first-order discrete-time covariance update with the initial covariance \({\mathbf{P}}_{{b}_{k}}^{{b}_{k}} = \mathbf{0}\) :

\({\mathbf{P}}_{{b}_{k + 1}}^{{b}_{k}}\) 可以通过一阶离散时间协方差更新递归计算，初始协方差为 \({\mathbf{P}}_{{b}_{k}}^{{b}_{k}} = \mathbf{0}\)：

\[{\mathbf{P}}_{t + {\delta t}}^{{b}_{k}} = \left( {\mathbf{I} + {\mathbf{F}}_{t}{\delta t}}\right) {\mathbf{P}}_{t}^{{b}_{k}}{\left( \mathbf{I} + {\mathbf{F}}_{t}\delta t\right) }^{T} + \left( {{\mathbf{G}}_{t}{\delta t}}\right) \mathbf{Q}{\left( {\mathbf{G}}_{t}\delta t\right) }^{T}, \]

\[t \in \left\lbrack {k,k + 1}\right\rbrack \tag{10} \]

where \(\mathbf{Q}\) is the diagonal covariance matrix of noise \(\left( {{\mathbf{\sigma }}_{a}^{2},{\mathbf{\sigma }}_{w}^{2},{\mathbf{\sigma }}_{{b}_{a}}^{2},{\mathbf{\sigma }}_{{b}_{w}}^{2}}\right)\) .

其中 \(\mathbf{Q}\) 是噪声 \(\left( {{\mathbf{\sigma }}_{a}^{2},{\mathbf{\sigma }}_{w}^{2},{\mathbf{\sigma }}_{{b}_{a}}^{2},{\mathbf{\sigma }}_{{b}_{w}}^{2}}\right)\) 的对角协方差矩阵。

Meanwhile,the first-order Jacobian matrix \({\mathbf{J}}_{{b}_{k + 1}}\) of \(\delta {\mathbf{z}}_{{b}_{k + 1}}^{{b}_{k}}\) with respect to \(\delta {\mathbf{z}}_{{b}_{k}}^{{b}_{k}}\) can be also compute recursively with the initial Jacobian \({\mathbf{J}}_{{b}_{k}} = \mathbf{I}\) ,

同时，\(\delta {\mathbf{z}}_{{b}_{k + 1}}^{{b}_{k}}\) 关于 \(\delta {\mathbf{z}}_{{b}_{k}}^{{b}_{k}}\) 的一阶雅可比矩阵 \({\mathbf{J}}_{{b}_{k + 1}}\) 也可以通过初始雅可比矩阵 \({\mathbf{J}}_{{b}_{k}} = \mathbf{I}\) 递归计算，

\[{\mathbf{J}}_{t + {\delta t}} = \left( {\mathbf{I} + {\mathbf{F}}_{t}{\delta t}}\right) {\mathbf{J}}_{t},\;t \in \left\lbrack {k,k + 1}\right\rbrack . \tag{11} \]

Using this recursive formulation, we get the covariance matrix \({\mathbf{P}}_{{b}_{k + 1}}^{{b}_{k}}\) and the Jacobian \({\mathbf{J}}_{{b}_{k + 1}}\) . The first order approximation of \({\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}},{\mathbf{\beta }}_{{b}_{k + 1}}^{{b}_{k}},{\mathbf{\gamma }}_{{b}_{k + 1}}^{{b}_{k}}\) with respect to biases can be write as:

使用这种递归公式，我们得到协方差矩阵 \({\mathbf{P}}_{{b}_{k + 1}}^{{b}_{k}}\) 和雅可比矩阵 \({\mathbf{J}}_{{b}_{k + 1}}\)。关于偏差的 \({\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}},{\mathbf{\beta }}_{{b}_{k + 1}}^{{b}_{k}},{\mathbf{\gamma }}_{{b}_{k + 1}}^{{b}_{k}}\) 的一阶近似可以写成：

\[{\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}} \approx {\widehat{\mathbf{\alpha }}}_{{b}_{k + 1}}^{{b}_{k}} + {\mathbf{J}}_{{b}_{a}}^{\alpha }\delta {\mathbf{b}}_{{a}_{k}} + {\mathbf{J}}_{{b}_{w}}^{\alpha }\delta {\mathbf{b}}_{{w}_{k}} \]

\[{\mathbf{\beta }}_{{b}_{k + 1}}^{{b}_{k}} \approx {\widehat{\mathbf{\beta }}}_{{b}_{k + 1}}^{{b}_{k}} + {\mathbf{J}}_{{b}_{a}}^{\beta }\delta {\mathbf{b}}_{{a}_{k}} + {\mathbf{J}}_{{b}_{w}}^{\beta }\delta {\mathbf{b}}_{{w}_{k}} \tag{12} \]

\[{\mathbf{\gamma }}_{{b}_{k + 1}}^{{b}_{k}} \approx {\widehat{\mathbf{\gamma }}}_{{b}_{k + 1}}^{{b}_{k}} \otimes \left\lbrack \begin{matrix} 1 \\ \frac{1}{2}{\mathbf{J}}_{{b}_{w}}^{\gamma }\delta {\mathbf{b}}_{{w}_{k}} \end{matrix}\right\rbrack \]

where \({\mathbf{J}}_{{b}_{a}}^{\alpha }\) and is the sub-block matrix in \({\mathbf{J}}_{{b}_{k + 1}}\) whose location is corresponding to \(\frac{\delta {\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}}}{\delta {\mathbf{b}}_{{a}_{k}}}\) . The same meaning is also used for \({\mathbf{J}}_{{b}_{w}}^{\alpha },{\mathbf{J}}_{{b}_{a}}^{\beta },{\mathbf{J}}_{{b}_{w}}^{\beta },{\mathbf{J}}_{{b}_{w}}^{\gamma }\) . When the estimation of bias changes slightly, we use 12 to correct pre-integration results approximately instead of re-propagation.

其中 \({\mathbf{J}}_{{b}_{a}}^{\alpha }\) 是 \({\mathbf{J}}_{{b}_{k + 1}}\) 中的子块矩阵，其位置对应于 \(\frac{\delta {\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}}}{\delta {\mathbf{b}}_{{a}_{k}}}\)。对于 \({\mathbf{J}}_{{b}_{w}}^{\alpha },{\mathbf{J}}_{{b}_{a}}^{\beta },{\mathbf{J}}_{{b}_{w}}^{\beta },{\mathbf{J}}_{{b}_{w}}^{\gamma }\) 也使用相同的含义。当偏差的估计略有变化时，我们使用 (12) 来近似校正预积分结果，而不是重新传播。

Fig. 4. An illustration of the visual-inertial alignment process for estimator initialization.

图 4. 视觉-惯性对准过程用于估计器初始化的示意图。

Now we are able to write down the IMU measurement model with its corresponding covariance \({\mathbf{P}}_{{b}_{k + 1}}^{{b}_{k}}\) :

现在我们能够写出 IMU 测量模型及其相应的协方差 \({\mathbf{P}}_{{b}_{k + 1}}^{{b}_{k}}\)：

\[\left\lbrack \begin{matrix} {\widehat{\mathbf{\alpha }}}_{{b}_{k + 1}}^{{b}_{k}} \\ {\widehat{\mathbf{\beta }}}_{{b}_{k + 1}}^{{b}_{k}} \\ {\widehat{\mathbf{\gamma }}}_{{b}_{k + 1}}^{{b}_{k}} \\ \mathbf{0} \end{matrix}\right\rbrack = \left\lbrack \begin{matrix} {\mathbf{R}}_{w}^{{b}_{k}}\left( {{\mathbf{p}}_{{b}_{k + 1}}^{w} - {\mathbf{p}}_{{b}_{k}}^{w} + \frac{1}{2}{\mathbf{g}}^{w}\Delta {t}_{k}^{2} - {\mathbf{v}}_{{b}_{k}}^{w}\Delta {t}_{k}}\right) \\ {\mathbf{R}}_{w}^{{b}_{k}}\left( {{\mathbf{v}}_{{b}_{k + 1}}^{w} + {\mathbf{g}}^{w}\Delta {t}_{k} - {\mathbf{v}}_{{b}_{k}}^{w}}\right) \\ {\mathbf{q}}_{{b}_{k}}^{{w}^{-1}} \otimes {\mathbf{q}}_{{b}_{k + 1}}^{w} \\ {\mathbf{b}}_{a{b}_{k + 1}} - {\mathbf{b}}_{a{b}_{k}} \\ {\mathbf{b}}_{w{b}_{k + 1}} - {\mathbf{b}}_{w{b}_{k}} \end{matrix}\right\rbrack . \tag{13} \]

V. Estimator Initialization

Monocular tightly-coupled visual-inertial odometry is a highly nonlinear system. Since the scale is not directly observable from a monocular camera, it is hard to directly fuse these two measurements without good initial values. One may assume a stationary initial condition to start the monocular VINS estimator. However, this assumption is inappropriate as initialization under motion is frequently encountered in real-world applications. The situation becomes even more complicated when IMU measurements are corrupted by large bias. In fact, initialization is usually the most fragile step for monocular VINS. A robust initialization procedure is needed to ensure the applicability of the system.

单目紧耦合视觉-惯性里程计是一个高度非线性的系统。由于尺度不能直接从单目相机中观测到，没有好的初始值很难直接融合这两种测量。有人可能会假设一个静止的初始条件来启动单目VINS估计器。然而，在现实世界应用中，初始化时经常遇到运动状态，这种假设是不恰当的。当IMU测量受到大偏差的干扰时，情况变得更加复杂。实际上，初始化通常是单目VINS中最脆弱的步骤。需要一个健壮的初始化过程来确保系统的适用性。

We adopt a loosely-coupled sensor fusion method to get initial values. We find that vision-only SLAM, or Structure from Motion (SfM), has a good property of initialization. In most cases, a visual-only system can bootstrap itself by derived initial values from relative motion methods, such as the eight-point [32] or five-point [33] algorithms, or estimating the homogeneous matrices. By aligning metric IMU pre-integration with the visual-only SfM results, we can roughly recover scale, gravity, velocity, and even bias. This is sufficient for bootstrapping a nonlinear monocular VINS estimator, as shown in Fig. 4.

我们采用松耦合的传感器融合方法来获取初始值。我们发现仅视觉SLAM，或者结构从运动（SfM），具有很好的初始化特性。在大多数情况下，仅视觉系统可以通过从相对运动方法中派生出的初始值来实现自举，例如八点[32]或五点[33]算法，或者估计齐次矩阵。通过将度量的IMU预积分与仅视觉SfM结果对齐，我们可以大致恢复尺度、重力、速度甚至偏差。这足以启动一个非线性的单目VINS估计器，如图4所示。

In contrast to [17], which simultaneously estimates gyroscope and accelerometer bias during the initialization phase, we choose to ignore accelerometer bias terms in the initial step. Accelerometer bias is coupled with gravity, and due to the large magnitude of the gravity vector comparing to platform dynamics, and the relatively short during of the initialization phase, these bias terms are hard to be observed. A detailed analysis of accelerometer bias calibration is presented in our previous work [34].

与[17]不同，后者在初始化阶段同时估计陀螺仪和加速度计的偏差，我们选择在初始步骤中忽略加速度计的偏差项。加速度计偏差与重力耦合，由于重力向量的大小与平台动力学相比很大，且初始化阶段相对较短，这些偏差项很难被观测到。我们之前的工作[34]详细分析了加速度计偏差校准。

A. Sliding Window Vision-Only SfM

The initialization procedure starts with a vision-only SfM to estimate a graph of up-to-scale camera poses and feature positions.

初始化过程从仅视觉SfM开始，估计一个相机姿态和特征位置的上限尺度图。

We maintain a sliding window of frames for bounded computational complexity. Firstly, we check feature correspondences between the latest frame and all previous frames. If we can find stable feature tracking (more than 30 tracked features) and sufficient parallax (more than 20 rotation-compensated pixels) between the latest frame and any other frames in the sliding window. we recover the relative rotation and up-to-scale translation between these two frames using the Five-point algorithm [33]. Otherwise, we keep the latest frame in the window and wait for new frames. If the five-point algorithm success, we arbitrarily set the scale and triangulate all features observed in these two frames. Based on these triangulated features, a perspective-n-point (PnP) method [35] is performed to estimate poses of all other frames in the window. Finally, a global full bundle adjustment [36] is applied to minimize the total reprojection error of all feature observations. Since we do not yet have any knowledge about the world frame, we set the first camera frame \({\left( \cdot \right) }^{{c}_{0}}\) as the reference frame for SfM. All frame poses \(\left( {{\overline{\mathbf{p}}}_{{c}_{k}}^{{c}_{0}},{\mathbf{q}}_{{c}_{k}}^{{c}_{0}}}\right)\) and feature positions are represented with respect to \({\left( \cdot \right) }^{{c}_{0}}\) . Suppose we have a rough measure extrinsic parameters \(\left( {{\mathbf{p}}_{c}^{b},{\mathbf{q}}_{c}^{b}}\right)\) between the camera and the IMU, we can translate poses from camera frame to body (IMU) frame,

我们维护一个帧的滑动窗口以限制计算复杂度。

首先，我们检查最新帧与所有先前帧之间的特征对应关系。
如果在最新帧和滑动窗口中的任何其他帧之间可以找到稳定的特征跟踪（超过30个跟踪特征）和足够视差（超过20个旋转补偿像素），我们使用五点算法 [33] 恢复这两帧之间的相对旋转和尺度不变的平移。
否则，我们保留窗口中的最新帧并等待新帧。
如果五点算法成功，我们任意设置尺度并对这两帧中观察到的所有特征进行三角测量。
基于这些三角测量的特征，执行透视-n点（PnP）方法 [35] 来估计窗口中所有其他帧的姿态。
最后，应用全局完整束调整 [36] 以最小化所有特征观测的总重投影误差。
由于我们尚未了解世界坐标系，我们将第一个相机帧 \({\left( \cdot \right) }^{{c}_{0}}\) 设置为SfM的参考帧。所有帧姿态 \(\left( {{\overline{\mathbf{p}}}_{{c}_{k}}^{{c}_{0}},{\mathbf{q}}_{{c}_{k}}^{{c}_{0}}}\right)\) 和特征位置都是相对于 \({\left( \cdot \right) }^{{c}_{0}}\) 表示的。

假设我们有一个粗略测量的相机与IMU之间的外参 \(\left( {{\mathbf{p}}_{c}^{b},{\mathbf{q}}_{c}^{b}}\right)\)，我们可以将姿态从相机坐标系转换到 body（IMU）坐标系。

\[{\mathbf{q}}_{{b}_{k}}^{{c}_{0}} = {\mathbf{q}}_{{c}_{k}}^{{c}_{0}} \otimes {\left( {\mathbf{q}}_{c}^{b}\right) }^{-1} \tag{14} \]

\[s{\overline{\mathbf{p}}}_{{b}_{k}}^{{c}_{0}} = s{\overline{\mathbf{p}}}_{{c}_{k}}^{{c}_{0}} - {\mathbf{R}}_{{b}_{k}}^{{c}_{0}}{\mathbf{p}}_{c}^{b} \]

where \(s\) the scaling parameter that aligns the visual structure to the metric scale. Solving this scaling parameter is the key to achieve successful initialization.

其中 \(s\) 是一个缩放参数，用于将视觉结构与度量尺度对齐。求解这个缩放参数是成功初始化的关键。

B. Visual-Inertial Alignment

Gyroscope Bias Calibration: Consider two consecutive frames \({b}_{k}\) and \({b}_{k + 1}\) in the window,we get the rotation \({\mathbf{q}}_{{b}_{k}}^{{c}_{0}}\) and \({\mathbf{q}}_{{b}_{k + 1}}^{{c}_{0}}\) from the visual SfM,as well as the relative constraint \({\widehat{\gamma }}_{{b}_{k + 1}}^{{b}_{k}}\) from IMU pre-integration. We linearize the IMU pre-integration term with respect to gyroscope bias and minimize the following cost function:

陀螺仪偏差校准：考虑窗口中的两个连续帧 \({b}_{k}\) 和 \({b}_{k + 1}\)，我们从视觉SfM获得旋转 \({\mathbf{q}}_{{b}_{k}}^{{c}_{0}}\) 和 \({\mathbf{q}}_{{b}_{k + 1}}^{{c}_{0}}\)，以及来自IMU预积分的相对约束 \({\widehat{\gamma }}_{{b}_{k + 1}}^{{b}_{k}}\)。我们相对于陀螺仪偏差线性化IMU预积分项，并最小化以下代价函数：

\[\mathop{\min }\limits_{{\delta {b}_{w}}}\mathop{\sum }\limits_{{k \in \mathcal{B}}}{\begin{Vmatrix}{\mathbf{q}}_{{b}_{k + 1}}^{{c}_{0}}{}^{-1} \otimes {\mathbf{q}}_{{b}_{k}}^{{c}_{0}} \otimes {\mathbf{\gamma }}_{{b}_{k + 1}}^{{b}_{k}}\end{Vmatrix}}^{2} \tag{15} \]

where \(\mathcal{B}\) indexes all frames in the window. We have the first order approximation of \({\widehat{\gamma }}_{{b}_{k + 1}}^{{b}_{k}}\) with respect to the gyroscope bias using the bias Jacobian derived in Sect. IV-B In such way, we get an initial calibration of the gyroscope bias \({\mathbf{b}}_{w}\) . Then we re-propagate all IMU pre-integration terms \({\widehat{\mathbf{\alpha }}}_{{b}_{k + 1}}^{{b}_{k}},{\widehat{\mathbf{\beta }}}_{{b}_{k + 1}}^{{b}_{k}}\) , and \({\widehat{\mathbf{\gamma }}}_{{b}_{k + 1}}^{{b}_{k}}\) using the new gyroscope bias.

其中 \(\mathcal{B}\) 索引窗口中的所有帧。我们得到了关于陀螺仪偏差 \({\widehat{\gamma }}_{{b}_{k + 1}}^{{b}_{k}}\) 的一阶近似，使用了在第四节-B中推导的偏差雅可比矩阵。这样，我们得到了陀螺仪偏差 \({\mathbf{b}}_{w}\) 的初始校准。然后我们重新传播所有 IMU 预积分项 \({\widehat{\mathbf{\alpha }}}_{{b}_{k + 1}}^{{b}_{k}},{\widehat{\mathbf{\beta }}}_{{b}_{k + 1}}^{{b}_{k}}\) 和 \({\widehat{\mathbf{\gamma }}}_{{b}_{k + 1}}^{{b}_{k}}\) ，使用新的陀螺仪偏差。

Velocity, Gravity Vector and Metric Scale Initialization: After the gyroscope bias is initialized, we move on to initialize

速度、重力向量和尺度初始化：在陀螺仪偏差初始化之后，我们继续初始化

Fig. 5. Illustration of 2 DOF parameterization of gravity. Since the magnitude of gravity is known, \(\mathbf{g}\) lies on a sphere with radius \(g \approx {9.81}\mathrm{\;m}/{\mathrm{s}}^{2}\) . The gravity is parameterized around current estimate as \(g \cdot \widehat{\mathbf{g}} + {w}_{1}{\mathbf{b}}_{1} + {w}_{2}{\mathbf{b}}_{2}\) , where \({\mathbf{b}}_{1}\) and \({\mathbf{b}}_{2}\) are two orthogonal basis spanning the tangent space.

图 5. 重力 2 自由度参数化的说明。由于已知重力的幅度，\(\mathbf{g}\) 位于半径为 \(g \approx {9.81}\mathrm{\;m}/{\mathrm{s}}^{2}\) 的球面上。重力围绕当前估计值进行参数化，表示为 \(g \cdot \widehat{\mathbf{g}} + {w}_{1}{\mathbf{b}}_{1} + {w}_{2}{\mathbf{b}}_{2}\) ，其中 \({\mathbf{b}}_{1}\) 和 \({\mathbf{b}}_{2}\) 是两个正交基，跨越切空间。

other essential states for navigation, namely velocity, gravity vector, and metric scale:

导航的其他必要状态，即速度、重力向量和度量尺度：

\[{\mathcal{X}}_{I} = \left\lbrack {{\mathbf{v}}_{{b}_{0}}^{{b}_{0}},{\mathbf{v}}_{{b}_{1}}^{{b}_{1}},\cdots {\mathbf{v}}_{{b}_{n}}^{{b}_{n}},{\mathbf{g}}^{{c}_{0}},s}\right\rbrack \tag{16} \]

where \({\mathbf{v}}_{{b}_{k}}^{{b}_{k}}\) is velocity in the body frame while taking the \({k}^{th}\) image, \({\mathbf{g}}^{{c}_{0}}\) is the gravity vector in the \({c}_{0}\) frame,and \(s\) scales the monocular SfM to metric units.

其中 \({\mathbf{v}}_{{b}_{k}}^{{b}_{k}}\) 是在捕获 \({k}^{th}\) 图像时机体坐标系中的速度，\({\mathbf{g}}^{{c}_{0}}\) 是 \({c}_{0}\) 帧中的重力向量，\(s\) 缩放单目 SfM 到度量单位。

Consider two consecutive frames \({b}_{k}\) and \({b}_{k + 1}\) in the window, then (5) can be written as:

考虑窗口中的两个连续帧 \({b}_{k}\) 和 \({b}_{k + 1}\) ，则（5）可以写成：

\[{\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}} = {\mathbf{R}}_{{c}_{0}}^{{b}_{k}}\left( {s\left( {{\overline{\mathbf{p}}}_{{b}_{k + 1}}^{{c}_{0}} - {\overline{\mathbf{p}}}_{{b}_{k}}^{{c}_{0}}}\right) + \frac{1}{2}{\mathbf{g}}^{{c}_{0}}\Delta {t}_{k}^{2} - {\mathbf{R}}_{{b}_{k}}^{{c}_{0}}{\mathbf{v}}_{{b}_{k}}^{{b}_{k}}\Delta {t}_{k}}\right) \]

\[{\mathbf{\beta }}_{{b}_{k + 1}}^{{b}_{k}} = {\mathbf{R}}_{{c}_{0}}^{{b}_{k}}\left( {{\mathbf{R}}_{{b}_{k + 1}}^{{c}_{0}}{\mathbf{v}}_{{b}_{k + 1}}^{{b}_{k + 1}} + {\mathbf{g}}^{{c}_{0}}\Delta {t}_{k} - {\mathbf{R}}_{{b}_{k}}^{{c}_{0}}{\mathbf{v}}_{{b}_{k}}^{{b}_{k}}}\right) . \tag{17} \]

We can combine 14 and 17 into the following linear measurement model:

我们可以将 14 和 17 合并为以下线性测量模型：

\[{\widehat{\mathbf{z}}}_{{b}_{k + 1}}^{{b}_{k}} = \left\lbrack \begin{matrix} {\widehat{\mathbf{\alpha }}}_{{b}_{k + 1}}^{{b}_{k}} - {\mathbf{p}}_{c}^{b} + {\mathbf{R}}_{{c}_{0}}^{{b}_{k}}{\mathbf{R}}_{{b}_{k + 1}}^{{c}_{0}}{\mathbf{p}}_{c}^{b} \\ {\widehat{\mathbf{\beta }}}_{{b}_{k + 1}}^{{b}_{k}} \end{matrix}\right\rbrack = {\mathbf{H}}_{{b}_{k + 1}}^{{b}_{k}}{\mathcal{X}}_{I} + {\mathbf{n}}_{{b}_{k + 1}}^{{b}_{k}} \tag{18} \]

where,

其中，

\[{\mathbf{H}}_{{b}_{k + 1}}^{{b}_{k}} = \left\lbrack \begin{matrix} - \mathbf{I}\Delta {t}_{k} & \mathbf{0} & \frac{1}{2}{\mathbf{R}}_{{c}_{0}}^{{b}_{k}}\Delta {t}_{k}^{2} & {\mathbf{R}}_{{c}_{0}}^{{b}_{k}}\left( {{\overline{\mathbf{p}}}_{{c}_{k + 1}}^{{c}_{0}} - {\overline{\mathbf{p}}}_{{c}_{k}}^{{c}_{0}}}\right) \\ - \mathbf{I} & {\mathbf{R}}_{{c}_{0}}^{{b}_{k}}{\mathbf{R}}_{{b}_{k + 1}}^{{c}_{0}} & {\mathbf{R}}_{{c}_{0}}^{{b}_{k}}\Delta {t}_{k} & \mathbf{0} \end{matrix}\right\rbrack \tag{19} \]

It can be seen that \({\mathbf{R}}_{{b}_{k}}^{{c}_{0}},{\mathbf{R}}_{{b}_{k + 1}}^{{c}_{0}},{\overline{\mathbf{p}}}_{{c}_{k}}^{{c}_{0}},{\overline{\mathbf{p}}}_{{c}_{k + 1}}^{{c}_{0}}\) are obtained from the up-to-scale monocular visual SfM. \(\Delta {t}_{k}\) is the time interval between two consecutive frames. By solving this linear least square problem:

可以看到 \({\mathbf{R}}_{{b}_{k}}^{{c}_{0}},{\mathbf{R}}_{{b}_{k + 1}}^{{c}_{0}},{\overline{\mathbf{p}}}_{{c}_{k}}^{{c}_{0}},{\overline{\mathbf{p}}}_{{c}_{k + 1}}^{{c}_{0}}\) 来自按比例缩放的单目视觉 SfM。\(\Delta {t}_{k}\) 是两个连续帧之间的时间间隔。通过解决这个线性最小二乘问题：

\[\mathop{\min }\limits_{{\mathcal{X}}_{I}}\mathop{\sum }\limits_{{k \in \mathcal{B}}}{\begin{Vmatrix}{\widehat{\mathbf{z}}}_{{b}_{k + 1}}^{{b}_{k}} - {\mathbf{H}}_{{b}_{k + 1}}^{{b}_{k}}{\mathcal{X}}_{I}\end{Vmatrix}}^{2} \tag{20} \]

we can get body frame velocities for every frame in the window,the gravity vector in the visual reference frame \({\left( \cdot \right) }^{{c}_{0}}\) , as well as the scale parameter.

我们可以得到窗口中每一帧的机体坐标系速度，视觉参考坐标系中的重力向量 \({\left( \cdot \right) }^{{c}_{0}}\) ，以及尺度参数。

Gravity Refinement: The gravity vector obtained from the previous linear initialization step can be refined by constraining the magnitude. In most cases, the magnitude of gravity vector is known. This results in only 2 DOF remaining for the gravity vector. We therefore re-parameterize the gravity with two variables on its tangent space. Our parameterization represents the gravity vector as \(g \cdot \overline{\widehat{\mathbf{g}}} + {w}_{1}{\mathbf{b}}_{1} + {w}_{2}{\mathbf{b}}_{2}\) ,where \(g\) is the know magnitude of the gravity, \(\widehat{\widehat{\mathbf{g}}}\) is a unit vector representing the gravity direction. \({\mathbf{b}}_{1}\) and \({\mathbf{b}}_{2}\) are two orthogonal basis spanning the tangent plane,as shown in Fig. 5 \({w}_{1}\) and \({w}_{2}\) are corresponding displacements towards \({\mathbf{b}}_{1}\) and \({\mathbf{b}}_{2}\) , respectively. We can find one set of \({\mathbf{b}}_{1},{\mathbf{b}}_{2}\) by cross products operations using Algorithm 1 . Then we substitute \(\mathbf{g}\) in (17) by \(g \cdot \widehat{\mathbf{g}} + {w}_{1}{\mathbf{b}}_{1} + {w}_{2}{\mathbf{b}}_{2}\) ,and solve for \({w}_{1}\) and \({w}_{2}\) together with other state variables. This process iterates until \(\widehat{\mathbf{g}}\) converges.

重力细化：通过之前的线性初始化步骤获得的重力向量可以通过约束其大小来进行细化。在大多数情况下，重力向量的大小是已知的。这导致重力向量仅剩下2个自由度。因此，我们使用其在切空间中的两个变量重新参数化重力。我们的参数化表示重力向量为 \(g \cdot \overline{\widehat{\mathbf{g}}} + {w}_{1}{\mathbf{b}}_{1} + {w}_{2}{\mathbf{b}}_{2}\)，其中 \(g\) 是已知的重力大小，\(\widehat{\widehat{\mathbf{g}}}\) 是表示重力方向的单位向量。\({\mathbf{b}}_{1}\) 和 \({\mathbf{b}}_{2}\) 是两个正交基，跨越切平面，如图5所示。\({w}_{1}\) 和 \({w}_{2}\) 分别是朝向 \({\mathbf{b}}_{1}\) 和 \({\mathbf{b}}_{2}\) 的相应位移。我们可以通过使用算法1的交叉乘积操作找到一组 \({\mathbf{b}}_{1},{\mathbf{b}}_{2}\)。然后我们将 \(\mathbf{g}\) 在 (17) 中替换为 \(g \cdot \widehat{\mathbf{g}} + {w}_{1}{\mathbf{b}}_{1} + {w}_{2}{\mathbf{b}}_{2}\)，并与其他状态变量一起求解 \({w}_{1}\) 和 \({w}_{2}\)。此过程迭代进行，直到 \(\widehat{\mathbf{g}}\) 收敛。

Algorithm 1: Finding \({\mathbf{b}}_{1}\) and \({\mathbf{b}}_{2}\)

if \(\overline{\widehat{\mathbf{g}}} \neq \left\lbrack {1,0,0}\right\rbrack\) then

\({\mathbf{b}}_{1} \leftarrow \operatorname{normalize}\left( {\overline{\widehat{\mathbf{g}}} \times \left\lbrack {1,0,0}\right\rbrack }\right)\) ;

else

\({\mathbf{b}}_{1} \leftarrow \operatorname{normalize}\left( {\overline{\widehat{\mathbf{g}}} \times \left\lbrack {0,0,1}\right\rbrack }\right) ;\)

end

\({\mathbf{b}}_{2} \leftarrow \overline{\widehat{\mathbf{g}}} \times {\mathbf{b}}_{1}\)

Completing Initialization: After refining the gravity vector,we can get the rotation \({\mathbf{q}}_{{c}_{0}}^{w}\) between the world frame and the camera frame \({c}_{0}\) by rotating the gravity to the z-axis. We then rotate all variables from reference frame \({\left( \cdot \right) }^{{c}_{0}}\) to the world frame \({\left( \cdot \right) }^{w}\) . The body frame velocities will also be rotated to world frame. Translational components from the visual SfM will be scaled to metric units. At this point, the initialization procedure is completed and all these metric values will be fed for a tightly-coupled monocular VIO.

完成初始化：细化重力向量后，我们可以通过将重力旋转到z轴来获得世界坐标系与相机坐标系之间的旋转 \({\mathbf{q}}_{{c}_{0}}^{w}\)。然后我们将所有变量从参考坐标系 \({\left( \cdot \right) }^{{c}_{0}}\) 旋转到世界坐标系 \({\left( \cdot \right) }^{w}\)。机体坐标系的速度也将被旋转到世界坐标系。来自视觉SfM的平移分量将被缩放到度量单位。至此，初始化过程完成，所有这些度量值将用于紧耦合的单目VIO。

VI. 紧耦合的单目VIO

After estimator initialization, we proceed with a sliding window-based tightly-coupled monocular VIO for high-accuracy and robust state estimation. An illustration of the sliding window formulation is shown in Fig. 3

在估计器初始化之后，我们继续进行基于滑动窗口的紧耦合单目VIO，以实现高精度和稳健的状态估计。滑动窗口公式的示意图如图3所示。

A. Formulation 公式化

The full state vector in the sliding window is defined as:

滑动窗口中的完整状态向量定义为：

\[\mathcal{X} = \left\lbrack {{\mathbf{x}}_{0},{\mathbf{x}}_{1},\cdots {\mathbf{x}}_{n},{\mathbf{x}}_{c}^{b},{\lambda }_{0},{\lambda }_{1},\cdots {\lambda }_{m}}\right\rbrack \]

\[{\mathbf{x}}_{k} = \left\lbrack {{\mathbf{p}}_{{b}_{k}}^{w},{\mathbf{v}}_{{b}_{k}}^{w},{\mathbf{q}}_{{b}_{k}}^{w},{\mathbf{b}}_{a},{\mathbf{b}}_{g}}\right\rbrack ,k \in \left\lbrack {0,n}\right\rbrack \tag{21} \]

\[{\mathbf{x}}_{c}^{b} = \left\lbrack {{\mathbf{p}}_{c}^{b},{\mathbf{q}}_{c}^{b}}\right\rbrack \]

where \({\mathbf{x}}_{k}\) is the IMU state at the time that the \({k}^{th}\) image is captured. It contains position, velocity, and orientation of the IMU in the world frame, and acceleration bias and gyroscope bias in the IMU body frame. \(n\) is the total number of keyframes,and \(m\) is the total number of features in the sliding window. \({\lambda }_{l}\) is the inverse depth of the \({l}^{th}\) feature from its first observation.

其中 \({\mathbf{x}}_{k}\) 是在捕获 \({k}^{th}\) 图像时 IMU 的状态。它包含 IMU 在世界坐标系中的位置、速度和方向，以及 IMU 机体坐标系中的加速度偏差和陀螺仪偏差。\(n\) 是关键帧的总数，\(m\) 是滑动窗口中的特征总数。\({\lambda }_{l}\) 是从第一次观测到的 \({l}^{th}\) 特征的逆深度。

We use a visual-inertial bundle adjustment formulation. We minimize the sum of prior and the Mahalanobis norm of all measurement residuals to obtain a maximum posteriori estimation:

我们使用视觉-惯性 bundle adjustment 公式。我们最小化先验信息和所有测量残差的马哈拉诺比斯范数的和，以获得最大后验估计：

\[\mathop{\min }\limits_{\mathcal{X}}\left\{ {{\begin{Vmatrix}{\mathbf{r}}_{p} - {\mathbf{H}}_{p}\mathcal{X}\end{Vmatrix}}^{2} + \mathop{\sum }\limits_{{k \in \mathcal{B}}}{\begin{Vmatrix}{\mathbf{r}}_{\mathcal{B}}\left( {\widehat{\mathbf{z}}}_{{b}_{k + 1}}^{{b}_{k}},\mathcal{X}\right) \end{Vmatrix}}_{{\mathbf{P}}_{{b}_{k + 1}}^{{b}_{k}}}^{2} + }\right. \]

\[\left. {\mathop{\sum }\limits_{{\left( {l,j}\right) \in \mathcal{C}}}\rho \left( {\begin{Vmatrix}{\mathbf{r}}_{\mathcal{C}}\left( {\widehat{\mathbf{z}}}_{l}^{{c}_{j}},\mathcal{X}\right) \end{Vmatrix}}_{{\mathbf{P}}_{l}^{{c}_{j}}}^{2}\right) }\right\} , \tag{22} \]

where the Huber norm [37] is defined as:

其中 Huber 范数 [37] 定义为：

\[\rho \left( s\right) = \left\{ \begin{array}{ll} 1 & s \geq 1 \\ 2\sqrt{s} - 1 & s < 1 \end{array}\right. \tag{23} \]

\({\mathbf{r}}_{\mathcal{B}}\left( {{\widehat{\mathbf{z}}}_{{b}_{k + 1}}^{{b}_{k}},\mathcal{X}}\right)\) and \({\mathbf{r}}_{\mathcal{C}}\left( {{\widehat{\mathbf{z}}}_{l}^{{c}_{j}},\mathcal{X}}\right)\) are residuals for IMU and visual measurements respectively. Detailed definition of the residual terms will be presented in Sect. VI-B and Sect. VI-C. \(\mathcal{B}\) is the set of all IMU measurements, \(\mathcal{C}\) is the set of features which have been observed at least twice in the current sliding window. \(\left\{ {{\mathbf{r}}_{p},{\mathbf{H}}_{p}}\right\}\) is the prior information from marginalization. Ceres Solver [38] is used for solving this nonlinear problem.

\({\mathbf{r}}_{\mathcal{B}}\left( {{\widehat{\mathbf{z}}}_{{b}_{k + 1}}^{{b}_{k}},\mathcal{X}}\right)\) 和 \({\mathbf{r}}_{\mathcal{C}}\left( {{\widehat{\mathbf{z}}}_{l}^{{c}_{j}},\mathcal{X}}\right)\) 分别是 IMU 和视觉测量的残差。残差项的详细定义将在第 VI-B 节和第 VI-C 节中介绍。\(\mathcal{B}\) 是所有 IMU 测量的集合，\(\mathcal{C}\) 是在当前滑动窗口中至少被观测两次的特征集合。\(\left\{ {{\mathbf{r}}_{p},{\mathbf{H}}_{p}}\right\}\) 是来自边缘化的先验信息。使用 Ceres Solver [38] 解决这个非线性问题。

B. IMU Measurement Residual IMU 测量残差

Consider the IMU measurements within two consecutive frames \({b}_{k}\) and \({b}_{k + 1}\) in the sliding window,according to the IMU measurement model defined in (13), the residual for pre-integrated IMU measurement can be defined as:

考虑滑动窗口中两个连续帧 \({b}_{k}\) 和 \({b}_{k + 1}\) 内的 IMU 测量，根据 (13) 中定义的 IMU 测量模型，预积分 IMU 测量的残差可以定义为：

\[{\mathbf{r}}_{\mathcal{B}}\left( {{\widehat{\mathbf{z}}}_{{b}_{k + 1}}^{{b}_{k}},\mathcal{X}}\right) = \left\lbrack \begin{matrix} \delta {\mathbf{\alpha }}_{{b}_{k + 1}}^{{b}_{k}} \\ \delta {\mathbf{\beta }}_{{b}_{k + 1}}^{{b}_{k}} \\ \delta {\mathbf{\theta }}_{{b}_{k + 1}}^{{b}_{k}} \\ \delta {\mathbf{b}}_{a} \\ \delta {\mathbf{b}}_{g} \end{matrix}\right\rbrack \]

\[= \left\lbrack \begin{matrix} {\mathbf{R}}_{w}^{{b}_{k}}\left( {{\mathbf{p}}_{{b}_{k + 1}}^{w} - {\mathbf{p}}_{{b}_{k}}^{w} + \frac{1}{2}{\mathbf{g}}^{w}\Delta {t}_{k}^{2} - {\mathbf{v}}_{{b}_{k}}^{w}\Delta {t}_{k}}\right) - {\widehat{\mathbf{\alpha }}}_{{b}_{k + 1}}^{{b}_{k}} \\ {\mathbf{R}}_{w}^{{b}_{k}}\left( {{\mathbf{v}}_{{b}_{k + 1}}^{w} + {\mathbf{g}}^{w}\Delta {t}_{k} - {\mathbf{v}}_{{b}_{k}}^{w}}\right) - {\widehat{\mathbf{\beta }}}_{{b}_{k + 1}}^{{b}_{k}} \\ 2{\left\lbrack {\mathbf{q}}_{{b}_{k}}^{{w}^{-1}} \otimes {\mathbf{q}}_{{b}_{k + 1}}^{w} \otimes {\left( {\widehat{\mathbf{\gamma }}}_{{b}_{k + 1}}^{{b}_{k}}\right) }^{-1}\right\rbrack }_{xyz} \\ {\mathbf{b}}_{a{b}_{k + 1}} - {\mathbf{b}}_{a{b}_{k}} \\ {\mathbf{b}}_{w{b}_{k + 1}} - {\mathbf{b}}_{w{b}_{k}} \end{matrix}\right\rbrack , \tag{24} \]

where \({\left\lbrack \cdot \right\rbrack }_{xyz}\) extracts the vector part of a quaternion \(\mathbf{q}\) for error state representation. \(\delta {\mathbf{\theta }}_{{b}_{k + 1}}^{{b}_{k}}\) is the three dimensional error-state representation of quaternion. \({\left\lbrack {\widehat{\alpha }}_{{b}_{k + 1}}^{{b}_{k}},{\widehat{\mathbf{\beta }}}_{{b}_{k + 1}}^{{b}_{k}},{\widehat{\mathbf{\gamma }}}_{{b}_{k + 1}}^{{b}_{k}}\right\rbrack }^{\mathrm{T}}\) are pre-integrated IMU measurement terms using only noisy accelerometer and gyroscope measurements within the time interval between two consecutive image frames. Accelerometer and gyroscope biases are also included in the residual terms for online correction.

其中 \({\left\lbrack \cdot \right\rbrack }_{xyz}\) 提取四元数 \(\mathbf{q}\) 的向量部分用于误差状态表示。\(\delta {\mathbf{\theta }}_{{b}_{k + 1}}^{{b}_{k}}\) 是四元数的三维误差状态表示。\({\left\lbrack {\widehat{\alpha }}_{{b}_{k + 1}}^{{b}_{k}},{\widehat{\mathbf{\beta }}}_{{b}_{k + 1}}^{{b}_{k}},{\widehat{\mathbf{\gamma }}}_{{b}_{k + 1}}^{{b}_{k}}\right\rbrack }^{\mathrm{T}}\) 是仅使用噪声加速度计和陀螺仪测量值在连续两帧图像之间的时间间隔内预积分的IMU测量项。加速度计和陀螺仪的偏差也包含在残差项中，用于在线校正。

C. Visual Measurement Residual

In contrast to traditional pinhole camera models that define reprojection errors on a generalized image plane, we define the camera measurement residual on a unit sphere. The optics for almost all types of cameras, including wide-angle, fisheye or omnidirectional cameras, can be modeled as a unit ray connecting the surface of a unit sphere. Consider the \({l}^{th}\) feature that is first observed in the \({i}^{\text{th }}\) image,the residual for the feature observation in the \({j}^{\text{th }}\) image is defined as:

与传统的针孔相机模型不同，后者在通用图像平面上定义重投影误差，我们在单位球上定义相机测量残差。几乎所有类型的相机（包括广角、鱼眼或全方位相机）的光学系统都可以建模为连接单位球表面的单位射线。考虑 \({l}^{th}\) 特征最初在 \({i}^{\text{th }}\) 图像中被观测到，该特征在 \({j}^{\text{th }}\) 图像中的观测残差定义为：

\[{\mathbf{r}}_{\mathcal{C}}\left( {{\widehat{\mathbf{z}}}_{l}^{{c}_{j}},\mathcal{X}}\right) = {\left\lbrack \begin{array}{ll} {\mathbf{b}}_{1} & {\mathbf{b}}_{2} \end{array}\right\rbrack }^{T} \cdot \left( {{\widehat{\bar{\mathcal{P}}}}_{l}^{{c}_{j}} - \frac{{\mathcal{P}}_{l}^{{c}_{j}}}{\begin{Vmatrix}{\mathcal{P}}_{l}^{{c}_{j}}\end{Vmatrix}}}\right) \]

\[{\widehat{\bar{\mathcal{P}}}}_{l}^{{c}_{j}} = {\pi }_{c}{}^{-1}\left( \left\lbrack \begin{array}{l} {\widehat{u}}_{l}^{{c}_{j}} \\ {\widehat{v}}_{l}^{{c}_{j}} \end{array}\right\rbrack \right) \tag{25} \]

\[{\mathcal{P}}_{l}^{{c}_{j}} = {\mathbf{R}}_{b}^{c}\left( {{\mathbf{R}}_{w}^{{b}_{j}}\left( {{\mathbf{R}}_{{b}_{i}}^{w}\left( {{\mathbf{R}}_{c}^{b}\frac{1}{{\lambda }_{l}}{\pi }_{c}^{-1}\left( \left\lbrack \begin{array}{l} {u}_{l}^{{c}_{i}} \\ {v}_{l}^{{c}_{i}} \end{array}\right\rbrack \right) }\right. }\right. }\right. \]

\[\left. {\left. {+{\mathbf{p}}_{c}^{b}}\right) + {\mathbf{p}}_{{b}_{i}}^{w} - {\mathbf{p}}_{{b}_{j}}^{w}}\right) - {\mathbf{p}}_{c}^{b}) \]

where \(\left\lbrack {{u}_{l}^{{c}_{i}},{v}_{l}^{{c}_{i}}}\right\rbrack\) is the first observation of the \({l}^{th}\) feature that happens in the \({i}^{th}\) image. \(\left\lbrack {{\widehat{u}}_{l}^{{c}_{j}},{\widehat{v}}_{l}^{{c}_{j}}}\right\rbrack\) is the observation of the same feature in the \({j}^{th}\) image. \({\pi }_{c}^{-1}\) is the back projection function which turns a pixel location into a unit vector using camera intrinsic parameters. Since the degrees-of-freedom of the vision residual is two, we project the residual vector onto the tangent plane. \({\mathbf{b}}_{1},{\mathbf{b}}_{2}\) are two arbitrarily selected orthogonal bases that span the tangent plane of \({\widehat{\mathcal{P}}}_{l}^{{c}_{j}}\) ,as shown in Fig. 6. We can find one set of \({\mathbf{b}}_{1},{\mathbf{b}}_{2}\) easily,as shown in Algorithm 1. \({\mathbf{P}}_{l}^{{c}_{j}}\) ,as used in (22),is the standard covariance of a fixed length in the tangent space.

其中 \(\left\lbrack {{u}_{l}^{{c}_{i}},{v}_{l}^{{c}_{i}}}\right\rbrack\) 是 \({l}^{th}\) 特征在 \({i}^{th}\) 图像中的第一次观测。\(\left\lbrack {{\widehat{u}}_{l}^{{c}_{j}},{\widehat{v}}_{l}^{{c}_{j}}}\right\rbrack\) 是同一特征在 \({j}^{th}\) 图像中的观测。\({\pi }_{c}^{-1}\) 是反投影函数，它使用相机内参将像素位置转换为单位向量。由于视觉残差的自由度为二，我们将残差向量投影到切平面。\({\mathbf{b}}_{1},{\mathbf{b}}_{2}\) 是任意选择的两个正交基，它们构成了 \({\widehat{\mathcal{P}}}_{l}^{{c}_{j}}\) 的切平面，如图6所示。我们可以容易地找到一组 \({\mathbf{b}}_{1},{\mathbf{b}}_{2}\)，如算法1所示。\({\mathbf{P}}_{l}^{{c}_{j}}\) 在（22）中用作切空间中固定长度的标准协方差。

Fig. 6. An illustration of the visual residual on a unit sphere. \({\widehat{\mathcal{P}}}_{l}^{{c}_{j}}\) is the unit vector for the observation of the \({l}^{th}\) feature in the \({j}^{th}\) frame. \({\mathcal{P}}_{l}^{{c}_{j}}\) is predicted feature measurement on the unit sphere by transforming its first observation in the \({i}^{th}\) frame to the \({j}^{th}\) frame. The residual is defined on the tangent plane of \({\widehat{\mathcal{P}}}_{l}^{{c}_{j}}\) .

图 6. 单位球上视觉残差的示意图。 \({\widehat{\mathcal{P}}}_{l}^{{c}_{j}}\) 是观测 \({l}^{th}\) 特征在 \({j}^{th}\) 坐标系中的单位向量。 \({\mathcal{P}}_{l}^{{c}_{j}}\) 是通过将 \({i}^{th}\) 坐标系中的第一次观测转换到 \({j}^{th}\) 坐标系后，在单位球上预测的特征测量值。残差定义在 \({\widehat{\mathcal{P}}}_{l}^{{c}_{j}}\) 的切平面上。

Fig. 7. An illustration of our marginalization strategy. If the second latest frame is a keyframe, we will keep it in the window, and marginalize the oldest frame and its corresponding visual and inertial measurements. Marginalized measurements are turned into a prior. If the second latest frame is not a keyframe, we will simply remove the frame and all its corresponding visual measurements. However, pre-integrated inertial measurements are kept for non-keyframes, and the pre-integration process is continued towards the next frame.

图 7. 我们边缘化策略的示意图。如果倒数第二帧是关键帧，我们将它保留在窗口中，并边缘化最旧的帧及其对应的视觉和惯性测量值。边缘化的测量值被转换为一个先验值。如果倒数第二帧不是关键帧，我们将只移除该帧及其所有对应的视觉测量值。然而，非关键帧的预积分惯性测量值将被保留，预积分过程继续向下一帧进行。

D. Marginalization 边缘化

In order to bound the computational complexity of our optimization-based VIO, marginalization is incorporated. We selectively marginalize out IMU states \({\mathbf{x}}_{k}\) and features \({\lambda }_{l}\) from the sliding window, meanwhile convert measurements corresponding to marginalized states into a prior.

为了限制基于优化的视觉惯性里程计算法的计算复杂度，我们引入了边缘化。我们选择性地从滑动窗口中边缘化 IMU 状态 \({\mathbf{x}}_{k}\) 和特征 \({\lambda }_{l}\)，同时将对应于边缘化状态的测量值转换为先验值。

As shown in the Fig. 7, when the second latest frame is a keyframe, it will stay in the window, and the oldest frame is marginalized out with its corresponding measurements. Otherwise, if the second latest frame is a non-keyframe, we throw visual measurements and keep IMU measurements that connect to this non-keyframe. We do not marginalize out all measurements for non-keyframes in order to maintain sparsity of the system. Our marginalization scheme aims to keep spatially separated keyframes in the window. This ensures sufficient parallax for feature triangulation, and maximize the probability of maintaining accelerometer measurements with large excitation.

如图 7 所示，当倒数第二帧是关键帧时，它将保留在窗口中，最旧的帧及其对应的测量值被边缘化。
否则，如果倒数第二帧是非关键帧，我们将丢弃视觉测量值并保留连接到该非关键帧的 IMU 测量值。
我们不对非关键帧的所有测量值进行边缘化，以保持系统的稀疏性。
我们的边缘化方案旨在在窗口中保持空间分离的关键帧。这确保了足够视差以进行特征三角测量，并最大化保持具有大激励的加速度计测量的概率。

Fig. 8. An illustration of motion-only bundle adjustment for camera-rate outputs.

图 8. 仅运动束调整用于相机率输出的示意图。

The marginalization is carried out using the Schur complement [39]. We construct a new prior based on all marginalized measurements related to the removed state. The new prior is added onto the existing prior.

边缘化是通过使用舒尔补 [39] 来执行的。我们构建了一个基于所有与移除状态相关的边缘化测量的新先验。这个新先验被添加到现有的先验上。

We do note that marginalization results in the early fix of linearization points, which may result in suboptimal estimation results. However, since small drifting is acceptable for VIO, we argue that the negative impact caused by marginalization is not critical.

我们确实注意到边缘化导致线性化点的早期固定，这可能会导致次优的估计结果。然而，由于对于 VIO 来说小的漂移是可以接受的，我们认为边缘化造成的负面影响并不关键。

在视觉惯性导航系统（VINS）中，边缘化（marginalization）是一种优化技术，用于在处理大量数据时保持系统的稀疏性和计算效率。具体来说，边缘化通过移除旧的或不重要的状态（如关键帧或特征点）来减少优化问题的规模，同时保留这些状态对系统的影响。

"舒尔补"（Schur complement）是一种数学工具，用于在矩阵分解和线性系统求解中简化计算。在边缘化过程中，舒尔补被用来将一个大的矩阵分解成两个较小的矩阵，从而减少计算复杂度。具体步骤如下：

构建信息矩阵：首先，将所有相关的测量值和状态组合成一个大的信息矩阵。

选择边缘化变量：确定哪些状态需要被边缘化（即移除）。

应用舒尔补：通过舒尔补，将信息矩阵分解成两个部分：一个包含保留状态的矩阵和一个包含边缘化状态的矩阵。

计算新先验：基于边缘化状态的矩阵，计算一个新的先验（prior），这个先验代表了被移除状态对系统的影响。

更新优化问题：将新的先验添加到现有的优化问题中，从而在保留系统稀疏性的同时，考虑被移除状态的影响。
通过这种方式，边缘化能够在保持系统稀疏性和计算效率的同时，有效地处理大量数据，确保优化结果的准确性和稳定性。

E. Motion-only Visual-Inertial Bundle Adjustment for Camera-Rate State Estimation

E. 仅运动视觉-惯性束调整用于相机率状态估计

For devices with low computational power, such as mobile phones, the tightly-coupled monocular VIO cannot achieve camera-rate outputs due to the heavy computation demands for the nonlinear optimization. To this end, we employ a lightweight motion-only visual-inertial bundle adjustment to boost the state estimation to camera-rate \(\left( {{30}\mathrm{\;{Hz}}}\right)\) .

对于计算能力较低的设备，如手机，紧耦合的单目 VIO 由于非线性优化的计算需求巨大，无法实现相机率输出。为此，我们采用轻量级的仅运动视觉-惯性束调整来提升状态估计至相机率 \(\left( {{30}\mathrm{\;{Hz}}}\right)\)。

The cost function for the motion-only visual-inertial bundle adjustment is the same as the one for monocular VIO in (22). However, instead of optimizing all states in the sliding window, we only optimize the poses and velocities of a fixed number of latest IMU states. We treat feature depth, extrinsic parameters, bias, and old IMU states that we do not want to optimize as constant values. We do use all visual and inertial measurements for the motion-only bundle adjustment. This results in much smoother state estimates than single frame PnP methods. An illustration of the proposed strategy is shown in Fig. 8. In contrast to the full tightly-coupled monocular VIO,which may cause more than \({50}\mathrm{\;{ms}}\) on state-of-the-art embedded computers, the motion-only visual-inertial bundle adjustment only takes about \(5\mathrm{\;{ms}}\) to compute. This enables the low-latency camera-rate pose estimation that is particularly beneficial for drone and AR applications.

仅运动视觉-惯性束调整的成本函数与单目 VIO 中的 (22) 相同。然而，我们不是优化滑动窗口中的所有状态，而是仅优化固定数量的最新 IMU 状态的位姿和速度。我们将不希望优化的特征深度、外参、偏差和旧 IMU 状态视为常量值。我们确实使用了所有视觉和惯性测量来进行仅运动束调整。这比单帧 PnP 方法得到的状态估计要平滑得多。图 8 展示了所提出策略的示意图。与可能导致最先进的嵌入式计算机上超过 \({50}\mathrm{\;{ms}}\) 的完全紧耦合单目 VIO 相比，仅运动视觉-惯性束调整的计算时间大约只有 \(5\mathrm{\;{ms}}\)。这使得低延迟的相机率位姿估计成为可能，这对于无人机和增强现实应用特别有益。

F. IMU Forward Propagation for IMU-Rate State Estimation

F. IMU前向传播用于IMU率状态估计

IMU measurements come at a much higher rate than visual measurements. Although the frequency of our VIO is limited by image capture frequency, we can still directly propagate the latest VIO estimate with the set of most recent IMU measurements to achieve IMU-rate performance. The high-frequency state estimates can be utilized as state feedback for closed loop closure. An autonomous flight experiment utilizing this IMU-rate state estimates is presented in Sect. IX-D.

IMU的测量频率远高于视觉测量。尽管我们的VIO受到图像捕获频率的限制，我们仍然可以直接传播最新的VIO估计值和最近的IMU测量集，以实现IMU率性能。高频状态估计可以用作回环反馈。在第IX-D节中，展示了利用这种IMU率状态估计的自主飞行实验。

G. Failure Detection and Recovery

G. 故障检测与恢复

Although our tightly-coupled monocular VIO is robust to various challenging environments and motions. Failure is still unavoidable due to violent illumination change or severely aggressive motions. Active failure detection and recovery strategy can improve the practicability of proposed system. Failure detection is an independent module that detects unusual output from the estimator. We are currently using the following criteria for failure detection:

尽管我们的紧耦合单目VIO在各种具有挑战性的环境和运动中具有鲁棒性，但由于强烈光照变化或极度剧烈的运动，故障仍然不可避免。主动故障检测与恢复策略可以提高所提系统的实用性。故障检测是一个独立模块，用于检测估计器的异常输出。我们目前使用以下标准进行故障检测：

The number of features being tracking in the latest frame is less than a certain threshold;
在最新帧中跟踪的特征数量少于特定阈值；
Large discontinuity in position or rotation between last two estimator outputs;
在最近的两个估计器输出之间位置或旋转存在大的不连续性；
Large change in bias or extrinsic parameters estimation;
偏差或外部参数估计发生大的变化；

Once a failure is detected, the system switches back to the initialization phase. Once the monocular VIO is successfully initialized, a new and separate segment of the pose graph will be created.

一旦检测到故障，系统将切换回初始化阶段。一旦单目VIO成功初始化，将创建一个新的独立姿态图段。

VII. Relocalization

VII. 重定位

Our sliding window and marginalization scheme bound the computation complexity, but it also introduces accumulated drifts for the system. To be more specific, drifts occur in global 3D position \(\left( {\mathrm{x},\mathrm{y},\mathrm{z}}\right)\) and the rotation around the gravity direction (yaw). To eliminate drifts, a tightly-coupled relocal-ization module that seamlessly integrates with the monocular VIO is proposed. The relocalization process starts with a loop detection module that identifies places that have already been visited. Feature-level connections between loop closure candidates and the current frame are then established. These feature correspondences are tightly integrated into the monocular VIO module, resulting in drift-free state estimates with minimum computation overhead. Multiple observations of multiple features are directly used for relocalization, resulting in higher accuracy and better state estimation smoothness. A graphical illustration of the relocalization procedure is shown in Fig. 9(a)

我们的滑动窗口和边缘化方案限制了计算复杂度，但也为系统引入了累积漂移。具体来说，漂移发生在全局3D位置 \(\left( {\mathrm{x},\mathrm{y},\mathrm{z}}\right)\) 以及围绕重力方向的旋转（偏航角）。为了消除漂移，我们提出了一种与单目视觉惯性里程计（VIO）紧耦合的重定位模块。重定位过程从循环检测模块开始，该模块识别已经访问过的地方。然后建立循环闭合候选点与当前帧之间的特征级连接。这些特征对应关系紧密集成到单目VIO模块中，实现了无漂移的状态估计，同时计算开销最小。重定位直接使用多个特征的多次观测，从而提高了准确性和状态估计的平滑性。重定位过程的图形说明如图9(a)所示。

A. Loop Detection

A. 循环检测

We utilize DBoW2 [6], a state-of-the-art bag-of-word place recognition approach, for loop detection. In addition to the corner features that are used for the monocular VIO, 500 more corners are detected and described by the BRIEF descriptor [40]. The additional corner features are used to achieve better recall rate on loop detection. Descriptors are treated as the visual word to query the visual database. DBoW2 returns loop closure candidates after temporal and geometrical consistency check. We keep all BRIEF descriptors for feature retrieving, but discard the raw image to reduce memory consumption.

我们使用了DBoW2 [6]，一种最先进的基于词包的地点识别方法，进行循环检测。除了用于单目VIO的角点特征外，还检测并使用BRIEF描述符 [40] 描述了额外的500个角点。这些额外的角点特征用于提高循环检测的召回率。描述符被视为视觉单词来查询视觉数据库。DBoW2在时间和几何一致性检查后返回循环闭合候选点。我们保留所有BRIEF描述符以供特征检索，但丢弃原始图像以减少内存消耗。

We note that our monocular VIO is able to render roll and pitch angles observable. As such, we do not need to rely on rotation-invariant features, such as the ORB feature used in ORB SLAM [4].

我们注意到我们的单目VIO能够使滚转角和俯仰角可观测。因此，我们不需要依赖旋转不变特征，如ORB SLAM [4] 中使用的ORB特征。

(b) Global Pose Graph optimization

(b) 全局位姿图优化

Fig. 9. A diagram illustrating the relocalization and pose graph optimization procedure. Fig. 9(a) shows the relocalization procedure. It starts with VIO-only pose estimates (blue). Past states are recorded (green). If a loop is detected for the newest keyframe (Sect. VII-A), as shown by the red line in the second plot, a relocalization occurred. Note that due to the use of feature-level correspondences for relocalization, we are able to incorporate loop closure constraints from multiple past keyframes (Sect. VII-C), as indicated in the last three plots. The pose graph optimization is illustrated in Fig. 9(b) A keyframe is added into the pose graph when it is marginalized out from the sliding window. If there is a loop between this keyframe and any other past keyframes, the loop closure constraints, formulated as 4-DOF relative rigid body transforms, will also be added to the pose graph. The pose graph is optimized using all relative pose constraints (Sect. VIII-B) in a separate thread, and the relocalization module always runs with respect to the newest pose graph configuration.

图 9. 一个展示重定位和位姿图优化过程的图示。图 9(a) 显示了重定位过程。它从仅使用 VIO 的位姿估计（蓝色）开始。过去的状态被记录（绿色）。如果检测到最新关键帧的循环（第 VII-A 节），如第二个图中的红色线所示，则发生了重定位。注意，由于重定位使用了特征级对应关系，我们能够从多个过去的关键帧中整合回环约束（第 VII-C 节），如最后三个图所示。位姿图优化在图 9(b) 中说明。当一个关键帧从滑动窗口中边缘化时，它会被添加到位姿图中。如果这个关键帧与任何其他过去的关键帧之间存在循环，则回环约束（形式化为 4-DOF 相对刚体变换）也会被添加到位姿图中。位姿图使用所有相对位姿约束（第 VIII-B 节）在单独的线程中进行优化，并且重定位模块始终相对于最新的位姿图配置运行。

B. Feature Retrieval

B. 特征检索

When a loop is detected, the connection between the local sliding window and the loop closure candidate is established by retrieving feature correspondences. Correspondences are found by BRIEF descriptor matching. Directly descriptor matching may cause a large number a lot of outliers. To this end, we use two-step geometric outlier rejection, as illustrated in Fig. 10

当检测到循环时，通过检索特征对应关系建立局部滑动窗口与回环候选之间的连接。通过 BRIEF 描述符匹配找到对应关系。直接描述符匹配可能导致大量异常值。为此，我们使用两步骤几何异常值拒绝，如图 10 所示

2D-2D: fundamental matrix test with RANSAC [31]. We use 2D observations of retrieved features in the current image and loop closure candidate image to perform fundamental matrix test.
2D-2D: 使用 RANSAC 进行基础矩阵测试 [31]。我们使用当前图像和回环候选图像中检索到的特征的 2D 观测值来执行基础矩阵测试。
3D-2D: PnP test with RANSAC [35]. Based on the known 3D position of features in the local sliding window, and 2D observations in the loop closure candidate
3D-2D：使用RANSAC [35]的PnP测试。基于局部滑动窗口中已知特征的三维位置和回环候选中的二维观测

Fig. 10. Descriptor matching and outlier removal for feature retrieval during loop closure.

图10。在回环期间进行特征检索的描述符匹配和异常值移除。

image, we perform PnP test.

图像，我们执行PnP测试。

When the number of inliers beyond a certain threshold, we treat this candidate as a correct loop detection and perform relocalization.

当内点数量超过一定阈值时，我们将这个候选视为正确的回环检测并执行重定位。

C. Tightly-Coupled Relocalization

C. 紧耦合重定位

The relocalization process effectively aligns the current sliding window maintained by the monocular VIO (Sect. VI) to the graph of past poses. During relocalization, we treat poses of all loop closure frames as constants. We jointly optimize the sliding window using all IMU measurements, local visual measurement measurements, and retrieved feature correspondences from loop closure. We can easily write the visual measurement model for retrieved features observed by a loop closure frame \(v\) to be the same as those for visual measurements in VIO, as shown in (25). The only difference is that the pose \(\left( {{\widehat{\mathbf{q}}}_{v}^{w},{\widehat{\mathbf{p}}}_{v}^{w}}\right)\) of the loop closure frame,which is taken from the pose graph (Sect. VIII), or directly from past odometry output (if this is the first relocalization), is treated as a constant. To this end, we can slightly modify the nonlinear cost function in (22) with additional loop terms:

重定位过程有效地将单目VIO（第VI节）维护的当前滑动窗口与过去位姿的图对齐。在重定位过程中，我们将所有回环帧的位姿视为常数。我们使用所有IMU测量值、局部视觉测量值以及从回环中检索到的特征对应关系来联合优化滑动窗口。我们可以轻松地将回环帧观察到的检索特征的视觉测量模型写成与VIO中的视觉测量相同，如（25）所示。唯一的区别是回环帧的位姿\(\left( {{\widehat{\mathbf{q}}}_{v}^{w},{\widehat{\mathbf{p}}}_{v}^{w}}\right)\)，它来自位姿图（第VIII节），或者直接来自过去的里程输出（如果是第一次重定位），被视为常数。为此，我们可以稍微修改（22）中的非线性代价函数，增加额外的回环项：

\[+ \mathop{\sum }\limits_{{\left( {l,j}\right) \in \mathcal{C}}}\rho \left( {\begin{Vmatrix}{\mathbf{r}}_{\mathcal{C}}\left( {\widehat{\mathbf{z}}}_{l}^{{c}_{j}},\mathcal{X}\right) \end{Vmatrix}}_{{\mathbf{P}}_{l}^{{c}_{j}}}^{2}\right) \tag{26} \]

\[\left. {+\mathop{\sum }\limits_{{\left( {l,v}\right) \in \mathcal{L}}}\rho \left( {\begin{Vmatrix}{\mathbf{r}}_{\mathcal{C}}\left( {\widehat{\mathbf{z}}}_{l}^{v},\mathcal{X},{\widehat{\mathbf{q}}}_{v}^{w},{\widehat{\mathbf{p}}}_{v}^{w}\right) \end{Vmatrix}}_{{\mathbf{P}}_{l}^{{c}_{v}}}^{2}\right. }\right\} , \]

where \(\mathcal{L}\) is the set of the observation of retrieved features in the loop closure frames. \(\left( {l,v}\right)\) means \({l}^{th}\) feature observed in the loop closure frame \(v\) . Note that although the cost function is slightly different from 22, the dimension of the states to be solved remains the same, as poses of loop closure frames are considered as constants. When multiple loop closures are established with the current sliding window, we optimize using all loop closure feature correspondences from all frames at the same time. This gives multi-view constraints for relocalization, resulting in higher accuracy and better smoothness. Note that the global optimization of past poses and loop closure frames happens after relocalization, and will be discussed in Sect. VIII.

其中 \(\mathcal{L}\) 是在回环帧中检索到的特征观测集合。\(\left( {l,v}\right)\) 表示 \({l}^{th}\) 在回环帧 \(v\) 中观察到的特征。注意，尽管代价函数与22略有不同，但待求解的状态维度保持不变，因为回环帧的姿态被视为常数。当当前滑动窗口建立了多个回环时，我们同时优化所有帧的所有回环特征对应。这为重定位提供了多视角约束，从而提高了准确性和平滑性。注意，重定位之后的过去姿态和回环帧的全局优化将在第VIII节中讨论。

VIII. Global Pose Graph Optimization

VIII. 全局姿态图优化

After relocalization, the local sliding window shifts and aligns with past poses. Utilizing the relocalization results, this additional pose graph optimization step is developed to ensure the set of past poses are registered into a globally consistent configuration.

重定位后，局部滑动窗口移动并与过去的姿态对齐。利用重定位结果，这个额外的姿态图优化步骤被开发出来，以确保过去的姿态集合被注册到一个全局一致的配置中。

Since our visual-inertial setup renders roll and pitch angles fully observable, the accumulated drift only occurs in four degrees-of-freedom (x, y, z and yaw angle). To this end, we ignore estimating the drift-free roll and pitch states, and only perform 4-DOF pose graph optimization.

由于我们的视觉-惯性设置使得横滚角和俯仰角完全可观测，累积漂移只发生在四个自由度（x, y, z 和偏航角）上。为此，我们忽略了对无漂移横滚角和俯仰角状态的估计，只进行4-DOF姿态图优化。

A. Adding Keyframes into the Pose Graph

A. 将关键帧添加到姿态图中

When a keyframe is marginalized out from the sliding window, it will be added to pose graph. This keyframe serves as a vertex in the pose graph, and it connects with other vertexes by two types of edges:

当一个关键帧从滑动窗口中边缘化出来时，它将被添加到姿态图中。这个关键帧在姿态图中作为一个顶点，它通过两种类型的边与其他顶点相连：

Sequential Edge: a keyframe will establish several sequential edges to its previous keyframes. A sequential edge represents the relative transformation between two keyframes in the local sliding window, which value is taken directly from VIO. Considering a newly marginalized keyframe \(i\) and one of its previous keyframes \(j\) ,the sequential edge only contains relative position \({\widehat{\mathbf{p}}}_{ij}^{i}\) and yaw angle \({\widehat{\psi }}_{ij}\) .

顺序边：关键帧将与其前一个关键帧建立多个顺序边。顺序边表示局部滑动窗口中两关键帧之间的相对变换，其值直接来自VIO。考虑一个新边缘化的关键帧 \(i\) 和它其中一个前一个关键帧 \(j\)，顺序边仅包含相对位置 \({\widehat{\mathbf{p}}}_{ij}^{i}\) 和偏航角 \({\widehat{\psi }}_{ij}\)。

\[{\widehat{\mathbf{p}}}_{ij}^{i} = {\widehat{\mathbf{R}}}_{i}^{{w}^{-1}}\left( {{\widehat{\mathbf{p}}}_{j}^{w} - {\widehat{\mathbf{p}}}_{i}^{w}}\right) \tag{27} \]

\[{\widehat{\psi }}_{ij} = {\widehat{\psi }}_{j} - {\widehat{\psi }}_{i} \]

Loop Closure Edge: If the newly marginalized keyframe has a loop connection, it will be connected with the loop closure frame by a loop closure edge in the pose graph. Similarly, the loop closure edge only contains 4-DOF relative pose transform that is defined the same as (27). The value of the loop closure edge is obtained using results from relocalization.

回环闭合边：如果新边缘化的关键帧具有回环连接，它将通过位姿图的回环闭合边与回环闭合帧连接。类似地，回环闭合边仅包含4自由度的相对位姿变换，其定义与(27)相同。回环闭合边的值通过重定位的结果获得。

B. 4-DOF Pose Graph Optimization

B. 4自由度位姿图优化

We define the residual of the edge between frames \(i\) and \(j\) minimally as:

我们定义帧 \(i\) 和 \(j\) 之间边的残差最小化为：

\[{\mathbf{r}}_{i,j}\left( {{\mathbf{p}}_{i}^{w},{\psi }_{i},{\mathbf{p}}_{j}^{w},{\psi }_{j}}\right) = \left\lbrack \begin{matrix} \mathbf{R}{\left( {\widehat{\phi }}_{i},{\widehat{\theta }}_{i},{\psi }_{i}\right) }^{-1}\left( {{\mathbf{p}}_{j}^{w} - {\mathbf{p}}_{i}^{w}}\right) - {\widehat{\mathbf{p}}}_{ij}^{i} \\ {\psi }_{j} - {\psi }_{i} - {\widehat{\psi }}_{ij} \end{matrix}\right\rbrack , \tag{28} \]

where \({\widehat{\phi }}_{i},{\widehat{\theta }}_{i}\) are the estimates roll and pitch angles,which are obtained directly from monocular VIO.

其中 \({\widehat{\phi }}_{i},{\widehat{\theta }}_{i}\) 是从单目VIO直接获得的估计滚转角和俯仰角。

The whole graph of sequential edges and loop closure edges are optimized by minimizing the following cost function:

通过最小化以下代价函数来优化顺序边和回环闭合边的整个图：

\[\mathop{\min }\limits_{{\mathbf{p},\psi }}\left\{ {\mathop{\sum }\limits_{{\left( {i,j}\right) \in \mathcal{S}}}{\begin{Vmatrix}{\mathbf{r}}_{i,j}\end{Vmatrix}}^{2} + \mathop{\sum }\limits_{{\left( {i,j}\right) \in \mathcal{L}}}\rho \left( {\begin{Vmatrix}{\mathbf{r}}_{i,j}\end{Vmatrix}}^{2}\right) }\right\} , \tag{29} \]

where \(\mathcal{S}\) is the set of all sequential edges and \(\mathcal{L}\) is the set of all loop closure edges. Although the tightly-coupled relocalization already helps with eliminating wrong loop closures, we add another Huber norm \(\rho \left( \cdot \right)\) to further reduce the impact of any possible wrong loops. In contrast, we do not use any robust norms for sequential edges, as these edges are extracted from VIO, which already contain sufficient outlier rejection mechanisms.

其中 \(\mathcal{S}\) 是所有顺序边的集合，\(\mathcal{L}\) 是所有回环闭合边的集合。尽管紧耦合的重定位已经帮助消除了错误的回环闭合，但我们又添加了一个Huber范数 \(\rho \left( \cdot \right)\) 以进一步减少任何可能的错误回环的影响。相比之下，我们没有为顺序边使用任何鲁棒范数，因为这些边是从VIO中提取的，已经包含了足够的异常值排斥机制。

Pose graph optimization and relocalization (Sect. VII-C) runs asynchronously in two separate threads. This enables immediate use of the most optimized pose graph for relo-calization whenever it becomes available. Similarly, even if the current pose graph optimization is not completed yet, relocalization can still take place using the existing pose graph configuration. This process is illustrated in Fig. 9(b)

位姿图优化和重定位（第VII-C节）在两个独立的线程中异步运行。这使得一旦最优化后的位姿图可用，就可以立即用于重定位。同样，即使当前的位姿图优化尚未完成，重定位仍然可以使用现有的位姿图配置进行。这个过程在图9(b)中有所说明。

C. Pose Graph Management

C. 位姿图管理

The size of the pose graph may grow unbounded when the travel distance increases, limiting the real-time performance of the system in the long run. To this end, we implement a downsample process to maintain the pose graph database to a limited size. All keyframes with loop closure constraints will be kept, while other keyframes that are either too close or have very similar orientations to its neighbors may be removed. The probability of a keyframe being removed is proportional to its spatial density to its neighbors.

当移动距离增加时，位姿图的大小可能会无限增长，从而在长期运行中限制了系统的实时性能。为此，我们实现了一个降采样过程，以保持位姿图数据库的有限大小。所有具有回环约束的关键帧将被保留，而其他关键帧，如果与相邻关键帧过于接近或具有非常相似的方向，则可能会被移除。关键帧被移除的概率与其相对于邻居的空间密度成比例。

IX. EXPERIMENTAL RESULTS

IX. 实验结果

We perform three experiments and two applications to evaluate the proposed VINS-Mono system. In the first experiment, we compare the proposed algorithm with another state-of-the-art algorithm on public datasets. We perform a numerical analysis to show the accuracy of our system. We then test our system in the indoor environment to evaluate the performance in repetitive scenes. A large-scale experiment is carried out to illustrate the long-time practicability of our system. Additionally, we apply the proposed system for two applications. For aerial robot application, we use VINS-Mono for position feedback to control a drone to follow a pre-defined trajectory. We then port our approach onto an iOS mobile device and compare against Google Tango.

我们进行了三个实验和两个应用来评估提出的VINS-Mono系统。在第一个实验中，我们在公共数据集上比较了所提出算法和另一种最先进的算法。我们进行了数值分析以展示系统的准确性。然后我们在室内环境中测试我们的系统，以评估在重复场景中的性能。我们还进行了一个大规模实验，以说明我们的系统长时间实用性。此外，我们将提出的系统应用于两个应用。在无人机应用中，我们使用VINS-Mono进行位置反馈，以控制无人机沿预定义轨迹飞行。然后我们将我们的方法移植到iOS移动设备上，并与Google Tango进行比较。

A. Dataset Comparison

A. 数据集比较

We evaluate our proposed VINS-Mono using the EuRoC MAV Visual-Inertial Datasets [41]. The datasets are collected onboard a micro aerial vehicle, which contains stereo images (Aptina MT9V034 global shutter, WVGA monochrome, 20 FPS), synchronized IMU measurements (ADIS16448, 200 \(\mathrm{{Hz}}\) ),and ground truth states (VICON and Leica MS50). We only use images from the left camera. Large IMU bias and illumination changes are observed in these datasets.

我们使用 EuRoC MAV 视觉-惯性数据集 [41] 评估了我们提出的 VINS-Mono。这些数据集是在微型飞行器上收集的，包含立体图像（Aptina MT9V034 全局快门，WVGA 单色，20 FPS），同步的 IMU 测量（ADIS16448，200 \(\mathrm{{Hz}}\)）以及地面真实状态（VICON 和 Leica MS50）。我们只使用左侧相机的图像。在这些数据集中观察到了较大的 IMU 偏差和光照变化。

Fig. 11. Trajectory in MH_03_median, compared with OKVIS.

图 11。在 MH_03_median 中的轨迹，与 OKVIS 对比。

Fig. 12. Translation error plot in MH_03_median.

图 12。在 MH_03_median 中的平移误差图。

In these experiments, we compare VINS-Mono with OKVIS [16], a state-of-the-art VIO that works with monocular and stereo cameras. OKVIS is another optimization-based sliding-window algorithm. Our algorithm is different with OKVIS in many details, as presented in the technical sections. Our system is complete with robust initialization and loop closure. We use two sequences, MH_03_median and MH_05_difficult, to show the performance of proposed method. To simplify the notation, we use VINS to denote our approach with only monocular VIO, and VINS_loop to denote the complete version with relocalization and pose graph optimization. We use OKVIS_mono and OKVIS_stereo to denote the OKVIS's results using monocular and stereo images respectively. For the fair comparison, we throw the first 100 outputs, and use the following 150 outputs to align with the ground truth, and compare the remaining estimator outputs.

在这些实验中，我们将 VINS-Mono 与 OKVIS [16] 进行比较，OKVIS 是一种最先进的单目和立体相机均适用的 VIO 方法。OKVIS 是另一种基于优化的滑动窗口算法。我们的算法与 OKVIS 在许多细节上有所不同，具体见技术部分。我们的系统具备鲁棒的初始化和回环功能。我们使用两个序列，MH_03_median 和 MH_05_difficult，来展示所提方法的表现。为了简化表示，我们使用 VINS 来表示只有单目 VIO 的我们的方法，使用 VINS_loop 来表示带有重定位和位姿图优化的完整版本。我们使用 OKVIS_mono 和 OKVIS_stereo 来分别表示 OKVIS 使用单目和立体图像的结果。为了公平比较，我们丢弃了前 100 个输出，并使用接下来的 150 个输出与地面真实值对齐，并比较剩余的估计器输出。

For the sequence \(\mathrm{{MH}}\_ {03}\) _median,the trajectory is shown in Fig. 11. We only compare translation error since rotated motion is negligible in this sequence. The \(\mathrm{x},\mathrm{y},\mathrm{z}\) error versus time, and the translation error versus distance are shown in Fig. 12 In the error plot, VINS-Mono with loop closure has the smallest translation error. We observe similar results in MH_05_difficult. The proposed method with loop function has the smallest translation error. The translation and rotation errors are shown in Fig. 14. Since the movement is smooth without much yaw angle change in this sequence, only position drift occurs. Obviously, the loop closure capability efficiently bound the accumulated drifts. OKVIS performs better in roll and pitch angle estimation. A possible reason is that VINS-Mono uses the pre-integration technique which is the first-order approximation of IMU propagation to save computation resource.

对于序列 \(\mathrm{{MH}}\_ {03}\) _中位数，轨迹显示在图11中。我们只比较了平移误差，因为在这个序列中旋转运动可以忽略不计。图12显示了 \(\mathrm{x},\mathrm{y},\mathrm{z}\) 随时间变化的误差以及随距离变化的平移误差。在误差图中，带有回环功能的VINS-Mono具有最小的平移误差。我们在MH_05_difficult序列中观察到了类似的结果。带有回环功能的提出方法具有最小的平移误差。平移和旋转误差显示在图14中。由于这个序列中的运动平滑且偏航角变化不大，所以只发生了位置漂移。显然，回环能力有效地限制了累积漂移。OKVIS在滚转角和俯仰角估计方面表现更好。一个可能的原因是VINS-Mono使用了预积分技术，这是IMU传播的一阶近似，以节省计算资源。

Fig. 13. Trajectory in MH_05_difficult, compared with OKVIS..

图13。在MH_05_difficult中的轨迹，与OKVIS的比较。

Fig. 14. Translation error and rotation error plot in MH_05_difficult.

图14。MH_05_difficult中的平移误差和旋转误差图。

VINS-Mono performs well in all EuRoC datasets, even in the most challenging sequence, V1_03_difficult, the one includes aggressive motion, texture-less area, and significant illumination change. The proposed method can initialize quickly in V1_03_difficult, due to the dedicated initialization procedure.

VINS-Mono在所有EuRoC数据集中表现良好，即使在最具挑战性的序列V1_03_difficult中，该序列包含了剧烈运动、无纹理区域和显著光照变化。由于专用的初始化程序，提出的方法在V1_03_difficult中可以快速初始化。

For pure VIO, both VINS-Mono and OKVIS have similar accuracy, it is hard to distinguish which one is better. However, VINS-Mono outperforms OKVIS at the system level. It is a complete system with robust initialization and loop closure function to assist the monocular VIO.

对于纯VIO，VINS-Mono和OKVIS具有相似的精度，很难区分哪个更好。然而，VINS-Mono在系统级别上优于OKVIS。它是一个完整的系统，具有强大的初始化和回环功能来辅助单目VIO。

Fig. 15. The device we used for the indoor experiment. It contains one forward-looking global shutter camera (MatrixVision mvBlueFOX-MLC200w) with \({752} \times {480}\) resolution. We use the built-in IMU (ADXL278 and ADXRS290, \({100}\mathrm{\;{Hz}}\) ) for the DJI A3 flight controller.

图 15. 我们用于室内实验的设备。它包含一个前视全局快门相机（MatrixVision mvBlueFOX-MLC200w）具有 \({752} \times {480}\) 分辨率。我们使用内置的 IMU（ADXL278 和 ADXRS290，\({100}\mathrm{\;{Hz}}\)）作为 DJI A3 飞行控制器。

Fig. 16. Sample images for the challenging indoor experiment.

图 16. 具有挑战性的室内实验的样本图像。

B. Indoor Experiment

B. 室内实验

In the indoor experiment, we choose our laboratory environment as the experiment area. The sensor suite we use is shown in Fig. 15 It contains a monocular camera \(\left( {{20}\mathrm{\;{Hz}}}\right)\) and an IMU \(\left( {{100}\mathrm{\;{Hz}}}\right)\) inside the DJI A3 controller \({}^{3}\) We hold the sensor suite by hand and walk in normal pace in the laboratory. We encounter pedestrians, low light condition, texture-less area, glass and reflection, as shown in Fig. 16. Videos can be found in the multimedia attachment.

在室内实验中，我们选择我们的实验室环境作为实验区域。我们使用的传感器套件如图 15 所示。它包含一个单目相机 \(\left( {{20}\mathrm{\;{Hz}}}\right)\) 和 DJI A3 控制器内的 IMU \(\left( {{100}\mathrm{\;{Hz}}}\right)\) \({}^{3}\)。我们用手拿着传感器套件，在实验室中以正常步伐行走。我们遇到了行人、低光照条件、无纹理区域、玻璃和反射，如图 16 所示。视频可以在多媒体附件中找到。

We compare our result with OKVIS, as shown in Fig. 17, Fig. 17(a) is the VIO output from OKVIS. Fig. 17(b) is the VIO-only result from proposed method without loop closure. Fig. 17(c) is the result of the proposed method with relocal-ization and loop closure. Noticeable VIO drifts occurred when we circle indoor. Both OKVIS and the VIO-only version of VINS-Mono accumulate significant drifts in \(\mathrm{x},\mathrm{y},\mathrm{z}\) ,and yaw angle. Our relocalization and loop closure modules efficiently eliminate all these drifts.

我们将我们的结果与 OKVIS 进行比较，如图 17 所示，图 17(a) 是 OKVIS 的 VIO 输出。图 17(b) 是提出的方法在没有回环的 VIO-only 结果。图 17(c) 是提出方法在重定位和回环后的结果。当我们在室内绕行时，明显的 VIO 偏移发生了。OKVIS 和 VINS-Mono 的 VIO-only 版本在 \(\mathrm{x},\mathrm{y},\mathrm{z}\) 和偏航角上积累了显著的漂移。我们的重定位和回环模块有效地消除了所有这些漂移。

C. Large-scale Environment

C. 大规模环境

Go out of the lab: We test VINS-Mono in a mixed indoor and outdoor setting. The sensor suite is the same as the
离开实验室：我们在一个室内和室外的混合环境中测试 VINS-Mono。传感器套件与

3http://www.dji.com/a3

Fig. 17. Results of the indoor experiment with comparison against OKVIS. one shown in Fig. 15 We started from a seat in the laboratory and go around the indoor space. Then we went down the stairs and walked around the playground outside the building. Next, we went back to the building and go upstairs. Finally, we returned to the same seat in the laboratory. The whole trajectory is more than 700 meters and last approximately ten minutes. A video of the experiment can be found in the multimedia attachment.

图 17. 室内实验结果与 OKVIS 对比。如图 15 所示，我们从实验室的一个座位出发，绕室内空间走一圈。然后我们下楼并在楼外的操场上走动。接下来，我们回到楼内并上楼。最后，我们返回到实验室的同一座位。整个轨迹超过 700 米，大约持续了十分钟。实验视频可以在多媒体附件中找到。

The trajectory is shown in Fig. 19 Fig. 19(a) is the trajectory from OKVIS. When we went up stairs, OKVIS shows unstable feature tracking, resulting in bad estimation. We cannot see the shape of stairs in the red block. The VIO-only result from VINS-Mono is shown in Fig. 19(b). The trajectory with loop closure is shown in Fig. 19(c). The shape of stairs is clear in proposed method. The closed loop trajectory is aligned with Google Map to verify its accuracy, as shown in Fig. 18.

轨迹如图 19 所示。图 19(a) 是 OKVIS 的轨迹。当我们上楼时，OKVIS 显示出不稳定的特点跟踪，导致估计结果不佳。在红色块中我们无法看到楼梯的形状。VINS-Mono 的仅 VIO 结果如图 19(b) 所示。带有回环的轨迹如图 19(c) 所示。在提出的方法中楼梯的形状清晰可见。回环轨迹与 Google Map 对齐以验证其准确性，如图 18 所示。

The final drift for OKVIS is [13.80, -5.26. 7.23]m in x, y and z-axis. The final dirft of VINS-Mono without loop closure is \(\left\lbrack {-{5.47},{2.76}. - {0.29}}\right\rbrack \mathrm{m}\) ,which occupies \({0.88}\%\) with respect to the total trajectory length,smaller than OKVIS \({2.36}\%\) . With loop correction. the final drift is bounded to [-0.032, 0.09, - \({0.07}\rbrack \mathrm{m}\) ,which is trivial compared to the total trajectory length. Although we do not have ground truth, we can still visually inspect that the optimized trajectory is smooth and can be precisely aligned with the satellite map.

OKVIS 的最终漂移在 x、y 和 z 轴上是 [13.80, -5.26, 7.23] 米。没有回环的 VINS-Mono 的最终漂移是 \(\left\lbrack {-{5.47},{2.76}. - {0.29}}\right\rbrack \mathrm{m}\)，占整个轨迹长度的 \({0.88}\%\)，小于 OKVIS 的 \({2.36}\%\)。经过回环校正后，最终漂移限制在 [-0.032, 0.09, -\({0.07}\rbrack \mathrm{m}\)]，与整个轨迹长度相比是微不足道的。尽管我们没有地面真实值，我们仍然可以通过视觉检查来确定优化后的轨迹是平滑的，并且可以精确地与卫星地图对齐。

Go around campus: This very large-scale dataset that goes around the whole HKUST campus was recorded with a handheld VI-Sensor \({}^{4}\) The dataset covers the ground that is around \({710}\mathrm{\;m}\) in length, \({240}\mathrm{\;m}\) in width,and with \({60}\mathrm{\;m}\) in height changes. The total path length is \({5.62}\mathrm{\;{km}}\) . The data contains the \({25}\mathrm{\;{Hz}}\) image and \({200}\mathrm{\;{Hz}}\) IMU lasting for 1 hour and 34 minutes. It is a very significant experiment to test the stability and durability of VINS-Mono.
环绕校园：这是一个环绕整个香港科技大学校园的非常大的数据集，它是用手持的VI-Sensor \({}^{4}\) 记录的。该数据集覆盖了大约 \({710}\mathrm{\;m}\) 长度，\({240}\mathrm{\;m}\) 宽度，以及高度变化为 \({60}\mathrm{\;m}\) 的地面。总路径长度为 \({5.62}\mathrm{\;{km}}\)。数据包含 \({25}\mathrm{\;{Hz}}\) 图像和持续1小时34分钟的 \({200}\mathrm{\;{Hz}}\) IMU。

Fig. 18. The estimated trajectory of the mixed indoor and outdoor experiment aligned with Google Map. The yellow line is the final estimated trajectory from VINS-Mono after loop closure. Red lines indicate loop closure.

图 18。混合室内外实验的估计轨迹与谷歌地图对齐。黄色线条是VINS-Mono在回环后得到的最终估计轨迹。红色线条表示回环。

(a) Estimated trajectory from OKVIS (b) Estimated trajectory from VINS-Mono with loop closure disabled

(a) OKVIS的估计轨迹 (b) 禁用回环的VINS-Mono的估计轨迹

Fig. 19. Estimated trajectories for the mixed indoor and outdoor experiment. In Fig. 19(a) results from OKVIS went bad during tracking lost in texture-less region (staircase). Figs. 19(b) 19(c) shows results from VINS-Mono without and with loop closure. Red lines indicate loop closures. The spiral-shaped blue line shows the trajectory when going up and down the stairs. We can see that VINS-Mono performs well (subject to acceptable drift) even without loop closure.

图 19。混合室内外实验的估计轨迹。在图 19(a) 中，OKVIS在纹理较少的区域（楼梯）跟踪丢失时结果变差。图 19(b) 和 19(c) 显示了没有和有回环的VINS-Mono的结果。红色线条表示回环。螺旋形的蓝色线条显示了上下楼梯时的轨迹。我们可以看到，即使没有回环，VINS-Mono的表现也很好（受到可接受的漂移影响）。

In this large-scale test, We set the keyframe database size to 2000 in order to provide sufficient loop information and achieve real-time performance. We run this dataset with an Intel i7-4790 CPU running at \({3.60}\mathrm{{GHz}}\) . Timing statistics are show in Table. II The estimated trajectory is aligned with Google map in Fig. 20 Compared with Google map, we can see our results are almost drift-free in this very long-duration test.

在这个大规模测试中，我们将关键帧数据库大小设置为2000，以提供足够的回环信息并实现实时性能。我们在配备 \({3.60}\mathrm{{GHz}}\) 的Intel i7-4790 CPU上运行这个数据集。时间统计如表 II 所示。估计的轨迹在图 20 中与谷歌地图对齐。与谷歌地图相比，我们可以看到在这个非常长时间的测试中，我们的结果几乎是无漂移的。

D. Application I: Feedback Control of an Aerial Robot

D. 应用一：无人机反馈控制

We apply VINS-Mono for autonomous feedback control of an aerial robot, as shown in Fig. 21(a) We use a forward-looking global shutter camera (MatrixVision mvBlueFOX-MLC200w) with \({752} \times {480}\) resolution,and equipped it with a 190-degree fisheye lens. A DJI A3 flight controller is used for both IMU measurements and for attitude stabilization control. The onboard computation resource is an Intel i7-5500U CPU running at \({3.00}\mathrm{{GHz}}\) . Traditional pinhole camera model is not suitable for large FOV camera. We use MEI [42] model for this camera, calibrated by the toolkit introduced in [43].

我们使用VINS-Mono进行无人机的自主反馈控制，如图21(a)所示。我们使用了一款前视全局快门相机（MatrixVision mvBlueFOX-MLC200w），分辨率为\({752} \times {480}\)，并配备了一个190度的鱼眼镜头。使用DJI A3飞行控制器进行IMU测量和姿态稳定控制。机载计算资源为运行频率为\({3.00}\mathrm{{GHz}}\)的Intel i7-5500U CPU。传统的针孔相机模型不适合大视场相机。我们使用MEI[42]模型来描述这款相机，并通过[43]中介绍的 toolkit 进行了标定。

In this experiment, we test the performance of autonomous trajectory tracking under using state estimates from VINS-Mono. Loop closure is disabled for this experiment The quadrotor is commanded to track a figure eight pattern with each circle being 1.0 meters in radius, as shown in Fig. 21(b) Four obstacles are put around the trajectory to verify the accuracy of VINS-Mono without loop closure. The quadrotor follows this trajectory four times continuously during the experiment. The \({100}\mathrm{\;{Hz}}\) onboard state estimates (Sect. VI-F) enables real-time feedback control of quadrotor.

在这个实验中，我们测试了使用VINS-Mono状态估计下的自主轨迹跟踪性能。实验中禁用了回环检测。四旋翼被指令跟踪一个如图21(b)所示的“8”字形轨迹，每个圆的半径为1.0米。轨迹周围放置了四个障碍物，以验证没有回环检测时VINS-Mono的准确性。实验中四旋翼连续四次跟踪此轨迹。机载状态估计[第六部分F节]使得四旋翼能够实现实时反馈控制。

Ground truth is obtained using OptiTrack \({}^{5}\) . Total trajectory length is \({61.97}\mathrm{\;m}\) . The final drift is \(\left\lbrack {{0.08},{0.09},{0.13}}\right\rbrack \mathrm{m}\) , resulting in \({0.29}\%\) position drift. Details of the translation and rotation as well as their corresponding errors are shown in Fig. 23

使用OptiTrack\({}^{5}\)获取地面真实值。总轨迹长度为\({61.97}\mathrm{\;m}\)。最终漂移为\(\left\lbrack {{0.08},{0.09},{0.13}}\right\rbrack \mathrm{m}\)，导致\({0.29}\%\)的位置漂移。平移和旋转的详细信息以及它们对应的误差如图23所示。

E. Application II: Mobile Device

We port VINS-Mono to mobile devices and present a simple AR application to showcase its accuracy and robustness. We name our mobile implementation VINS-Mobile \({}^{6}\) and we compare it against Google Tango device \({}^{7}\) ,which is one of the best commercially available augmented reality solutions on mobile platforms.

我们将 VINS-Mono 移植到移动设备上，并展示了一个简单的增强现实应用程序来展示其准确性和鲁棒性。我们将移动设备上的实现命名为 VINS-Mobile \({}^{6}\)，并将其与 Google Tango 设备 \({}^{7}\) 进行比较，后者是移动平台上商业可用的最佳增强现实解决方案之一。

VINS-Mono runs on an iPhone7 Plus. we use \({30}\mathrm{\;{Hz}}\) images with \({640} \times {480}\) resolution captured by the iPhone,and IMU data at \({100}\mathrm{\;{Hz}}\) obtained by the built-in InvenSense MP67B 6-axis gyroscope and accelerometer. As Fig. 24 shows, we mount the iPhone together with a Tango-Enabled Lenovo Phab 2 Pro. The Tango device uses a global shutter fisheye camera and synchronized IMU for state estimation. Firstly, we insert a virtual cube on the plane which is extracted from estimated visual features as shown in Fig. 25(a). Then we hold these

VINS-Mono 在 iPhone7 Plus 上运行。我们使用 \({30}\mathrm{\;{Hz}}\) 分辨率为 \({640} \times {480}\) 的图像，以及由内置的 InvenSense MP67B 6轴陀螺仪和加速度计获得的 \({100}\mathrm{\;{Hz}}\) IMU 数据。如图 24 所示，我们将 iPhone 与 Tango-Enabled Lenovo Phab 2 Pro 一起安装。Tango 设备使用全局快门鱼眼相机和同步 IMU 进行状态估计。首先，我们在图 25(a) 所示的由估计的视觉特征提取出的平面上插入一个虚拟立方体。然后我们拿着这些

\({}^{5}\) http://optitrack.com/

\({}^{6}\) https://github.com/HKUST-Aerial-Robotics/VINS-Mobile

\({}^{7}\) http://shopap.lenovo.com/hk/en/tango/ two devices and walk inside and outside the room in a normal pace. When loops are detected, we use the 4-DOF pose graph optimization (Sect. VIII-B) to eliminate x, y, z and yaw drifts.

\({}^{7}\) http://shopap.lenovo.com/hk/en/tango/ 两个设备，以正常步伐在房间内外走动。当检测到回路时，我们使用 4-DOF 位姿图优化（第 VIII-B 节）来消除 x、y、z 和偏航漂移。

TABLE I

TIMING STATISTICS

时间统计

Tread	Modules	Time (ms)	Rate (Hz)
1	Feature detector	15	25
1	KLT tracker	5	25
2	Window optimization	50	10
3	Loop detection	100
3	Pose graph optimization	130

Fig. 20. The estimated trajectory of the very large-scale environment aligned with Google map. The yellow line is the estimated trajectory from VINS-Mono. Red lines indicates loop closure.

图 20。与 Google 地图对齐的大规模环境估计轨迹。黄色线是 VINS-Mono 的估计轨迹。红色线表示回路闭合。

(b) Testing environment and desired figure eight pattern

(b) 测试环境和期望的“8”字形模式

Fig. 21. Fig 21(a) The self-developed aerial robot with a forward-looking fisheye camera (MatrixVision mvBlueFOX-MLC200w, 190 FOV) and an DJI A3 flight controller (ADXL278 and ADXRS290, \({100}\mathrm{\;{Hz}}\) ). Fig. 21(b). The designed trajectory. Four known obstacles are placed. The yellow line is the predefined figure eight-figure pattern which the aerial robot should follow. The robot follows the trajectory four times with loop closure disabled. A video of the experiment can be found in the multimedia attachment.

图 21。图 21(a) 自研的无人机，配备向前看的鱼眼相机（MatrixVision mvBlueFOX-MLC200w，190° 视场）和 DJI A3 飞行控制器（ADXL278 和 ADXRS290，\({100}\mathrm{\;{Hz}}\)）。图 21(b)。设计的轨迹。放置了四个已知的障碍物。黄色线条是预定义的八字形图案，无人机应该遵循该图案。机器人禁用回环的情况下沿轨迹飞行四次。实验视频可以在多媒体附件中找到。

Interestingly, when we open a door, Tango's yaw estimation jumps a big angle, as shown in Fig. 25(b). The reason maybe estimator crash caused by unstable feature tracking or active failure detection and recovery. However, VINS-Mono still works well in this challenging case. After traveling about \({264}\mathrm{\;m}\) ,we return to the start location. The final result can be seen in Fig. 25(c), the trajectory of Tango suffers drifting in the last lap while our VINS returns to the start point. The drift in total trajectory is eliminated due to the 4-DOF pose graph optimization. This is also evidenced by the fact that the cube is registered to the same place on the image comparing to the beginning.

有趣的是，当我们打开一扇门时，如图 25(b) 所示，Tango 的偏航估计会跳转一个很大的角度。原因可能是由于不稳定特征跟踪或主动故障检测与恢复导致的估计器崩溃。然而，VINS-Mono 在这个具有挑战性的情况下仍然表现良好。在大约 \({264}\mathrm{\;m}\) 行驶后，我们返回到起始位置。最终结果可以在图 25(c) 中看到，Tango 的轨迹在最后一圈出现漂移，而我们的 VINS 回到了起点。由于 4-DOF 位姿图优化，总轨迹的漂移被消除了。这也得到了这样一个事实的证实：与开始时相比，立方体在图像上注册在相同的位置。

Fig. 22. The trajectory of loop closure-disabled VINS-Mono on the MAV platform and its comparison against the ground truth. The robot follows the trajectory four times. VINS-Mono estimates are used as real-time position feedback for the controller. Ground truth is obtained using OptiTrack. Total length is \({61.97}\mathrm{\;m}\) . Final drift is \({0.18}\mathrm{\;m}\) .

图 22。禁用回环的 VINS-Mono 在 MAV 平台上的轨迹及其与实际路径的比较。机器人沿轨迹飞行四次。VINS-Mono 的估计值被用作控制器的实时位置反馈。实际路径是通过 OptiTrack 获取的。总长度为 \({61.97}\mathrm{\;m}\)。最终漂移为 \({0.18}\mathrm{\;m}\)。

Admittedly, Tango is more accurate than our implementation especially for local state estimates. However, this experiment shows that the proposed method can run on general mobile devices and have the potential ability to compare specially engineered devices. The robustness of proposed method is also proved in this experiment. Video can be found in the multimedia attachment.

诚然，Tango在本地状态估计方面比我们的实现更为精确。然而，这个实验表明，所提出的方法可以在通用移动设备上运行，并且具有与专门设计设备相比较的潜力。该方法的鲁棒性也在本实验中得到了证明。视频可以在多媒体附件中找到。

Fig. 23. Position, orientation and their corresponding errors of loop closure-disabled VINS-Mono compared with OptiTrack. A sudden in pitch error at the 60s is caused by aggressive breaking at the end of the designed trajectory, and possible time misalignment error between VINS-Mono and OptiTrack.

图 23. 与OptiTrack相比，禁用回环的VINS-Mono的位置、方向及其相应误差。在60秒时出现的俯仰误差是由于在设计轨迹结束时急刹车造成的，以及VINS-Mono与OptiTrack之间可能存在的时间对齐误差。

Fig. 24. A simple holder that we used to mount the Google Tango device (left) and the iPhone7 Plus (right) that runs our VINS-Mobile implementation.

图 24. 我们用来安装Google Tango设备（左）和运行我们VINS-Mobile实现的iPhone7 Plus（右）的简易支架。

X. Conclusion and Future Work

In this paper, we propose a robust and versatile monocular visual-inertial estimator. Our approach features both state-of-the-art and novel solutions to IMU pre-integration, estimator initialization and failure recovery, online extrinsic calibration, tightly-coupled visual-inertial odometry, relocalization, and efficient global optimization. We show superior performance by comparing against both state-of-the-art open source implementations and highly optimized industry solutions. We open source both PC and iOS implementation for the benefit of the community.

在本文中，我们提出了一种鲁棒且多功能的单目视觉-惯性估计器。我们的方法在IMU预积分、估计器初始化和故障恢复、在线外部校准、紧耦合的视觉-惯性里程计、重定位和高效全局优化方面都采用了最先进和创新的解决方案。通过与最先进的开源实现和高度优化的工业解决方案进行比较，我们展示了卓越的性能。为了社区的益处，我们开源了PC和iOS的实现。

Although feature-based VINS estimators have already reached the maturity of real-world deployment, we still see many directions for future research. Monocular VINS may reach weakly observable or even degenerate conditions depending on the motion and the environment. We are most interested in online methods to evaluate the observability properties of monocular VINS, and online generation of motion plans to restore observability. Another research direction concerns the mass deployment of monocular VINS on a large variety of consumer devices, such as mobile phones. This application requires online calibration of almost all sensor intrinsic and extrinsic parameters, as well as the online identification of calibration qualities. Finally, we are interested in producing dense maps given results from monocular VINS. Our first results on monocular visual-inertial dense mapping with application to drone navigation was presented in [44]. However, extensive research is still necessary to further improve the system accuracy and robustness.

尽管基于特征的VINS估计器已经在实际应用中达到成熟水平，但我们仍然看到许多未来研究的方向。单目VINS可能会根据运动和环境达到观测性较弱甚至退化的条件。我们最感兴趣的是在线评估单目VINS的观测性特性，以及在线生成运动计划以恢复观测性。另一个研究方向涉及在多种消费设备上大规模部署单目VINS，例如手机。这种应用需要在线校准几乎所有传感器的内在和外在参数，以及在线识别校准质量。最后，我们感兴趣的是根据单目VINS的结果生成密集地图。我们关于单目视觉-惯性密集映射的首批成果及其在无人机导航中的应用在[44]中进行了介绍。然而，为了进一步提高系统的准确性和鲁棒性，还需要进行广泛的研究。

Fig. 25. From left to right: AR image from VINS-Mobile, estimated trajectory from VINS-Mono, estimated trajectory from Tango Fig: 25(a) Both VINS-Mobile and Tango are initialized at the start location and a virtual box is inserted on the plane which extracted from estimated features. Fig. 25(b) A challenging case in which the camera is facing a moving door. The drift of Tango trajectory is highlighted. Fig. 25(c) The final trajectory of both VINS-Mobile and Tango. The total length is about \({264}\mathrm{\;m}\) .

图25。从左到右：VINS-Mobile的AR图像，VINS-Mono的估计轨迹，Tango的估计轨迹：图25(a) VINS-Mobile和Tango都在起始位置初始化，并在从估计特征中提取的平面上插入一个虚拟盒子。图25(b) 一个具有挑战性的情况，摄像头正对着一个移动的门。Tango轨迹的漂移被突出显示。图25(c) VINS-Mobile和Tango的最终轨迹。总长度约为 \({264}\mathrm{\;m}\) 。

REFERENCES

[1] G. Klein and D. Murray, "Parallel tracking and mapping for small ar workspaces," in Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on. IEEE, 2007, pp. 225-234.

[2] C. Forster, M. Pizzoli, and D. Scaramuzza, "SVO: Fast semi-direct monocular visual odometry," in Proc. of the IEEE Int. Conf. on Robot. and Autom., Hong Kong, China, May 2014.

[3] J. Engel, T. Schöps, and D. Cremers, "Lsd-slam: Large-scale direct monocular slam," in European Conference on Computer Vision. Springer International Publishing, 2014, pp. 834-849.

[4] R. Mur-Artal, J. Montiel, and J. D. Tardos, "Orb-slam: a versatile and accurate monocular slam system," IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147-1163, 2015.

[5] J. Engel, V. Koltun, and D. Cremers, "Direct sparse odometry," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

[6] D. Gálvez-López and J. D. Tardós, "Bags of binary words for fast place recognition in image sequences," IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188-1197, October 2012.

[7] S. Shen, N. Michael, and V. Kumar, "Tightly-coupled monocular visual-inertial fusion for autonomous flight of rotorcraft MAVs," in Proc. of the IEEE Int. Conf. on Robot. and Autom., Seattle, WA, May 2015.

[8] Z. Yang and S. Shen, "Monocular visual-inertial state estimation with online initialization and camera-imu extrinsic calibration," IEEE Transactions on Automation Science and Engineering, vol. 14, no. 1, pp. 39-51, 2017.

[9] T. Qin and S. Shen, "Robust initialization of monocular visual-inertial estimation on aerial robots." in Proc. of the IEEE/RSJ Int. Conf. on Intell. Robots and Syst., Vancouver, Canada, 2017, accepted.

[10] P. Li, T. Qin, B. Hu, F. Zhu, and S. Shen, "Monocular visual-inertial state estimation for mobile augmented reality." in Proc. of the IEEE Int. Sym. on Mixed adn Augmented Reality, Nantes, France, 2017, accepted.

[11] S. Weiss, M. W. Achtelik, S. Lynen, M. Chli, and R. Siegwart, "Real-time onboard visual-inertial state estimation and self-calibration of mavs in unknown environments," in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 957-964.

[12] S. Lynen, M. W. Achtelik, S. Weiss, M. Chli, and R. Siegwart, "A robust and modular multi-sensor fusion approach applied to mav navigation," in Proc. of the IEEE/RSJ Int. Conf. on Intell. Robots and Syst. IEEE, 2013, pp. 3923-3929.

[13] A. I. Mourikis and S. I. Roumeliotis, "A multi-state constraint Kalman filter for vision-aided inertial navigation," in Proc. of the IEEE Int. Conf. on Robot. and Autom., Roma, Italy, Apr. 2007, pp. 3565-3572.

[14] M. Li and A. Mourikis, "High-precision, consistent EKF-based visual-inertial odometry," Int. J. Robot. Research, vol. 32, no. 6, pp. 690-711, May 2013.

[15] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, "Robust visual inertial odometry using a direct ekf-based approach," in Proc. of the IEEE/RSJ Int. Conf. on Intell. Robots and Syst. IEEE, 2015, pp. 298- 304.

[16] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, "Keyframe-based visual-inertial odometry using nonlinear optimization," Int. J. Robot. Research, vol. 34, no. 3, pp. 314-334, Mar. 2014.

[17] R. Mur-Artal and J. D. Tardos, "Visual-inertial monocular SLAM with map reuse," arXiv preprint arXiv:1610.05949, 2016.

[18] K. Wu, A. Ahmed, G. A. Georgiou, and S. I. Roumeliotis, "A square root inverse filter for efficient vision-aided inertial navigation on mobile devices." in Robotics: Science and Systems, 2015.

[19] M. K. Paul, K. Wu, J. A. Hesch, E. D. Nerurkar, and S. I. Roumeliotis, "A comparative analysis of tightly-coupled monocular, binocular, and stereo vins," in Proc. of the IEEE Int. Conf. on Robot. and Autom., Singapore, May 2017.

[20] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and F. Dellaert, "isam2: Incremental smoothing and mapping using the bayes tree," The International Journal of Robotics Research, vol. 31, no. 2, pp. 216-235, 2012.

[21] V. Usenko, J. Engel, J. Stückler, and D. Cremers, "Direct visual-inertial odometry with stereo cameras," in Proc. of the IEEE Int. Conf. on Robot. and Autom. IEEE, 2016, pp. 1885-1892.

[22] T. Lupton and S. Sukkarieh, "Visual-inertial-aided navigation for high-dynamic motion in built environments without initial conditions," IEEE Trans. Robot., vol. 28, no. 1, pp. 61-76, Feb. 2012.

[23] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, "IMU prein-tegration on manifold for efficient visual-inertial maximum-a-posteriori estimation," in Proc. of Robot.: Sci. and Syst., Rome, Italy, Jul. 2015.

[24] S. Shen, Y. Mulgaonkar, N. Michael, and V. Kumar, "Initialization-free monocular visual-inertial estimation with application to autonomous MAVs," in Proc. of the Int. Sym. on Exp. Robot., Marrakech, Morocco, Jun. 2014.

[25] A. Martinelli, "Closed-form solution of visual-inertial structure from motion," Int. J. Comput. Vis., vol. 106, no. 2, pp. 138-152, 2014.

[26] J. Kaiser, A. Martinelli, F. Fontana, and D. Scaramuzza, "Simultaneous state initialization and gyroscope bias calibration in visual inertial aided navigation," IEEE Robotics and Automation Letters, vol. 2, no. 1, pp. 18-25, 2017.

[27] M. Faessler, F. Fontana, C. Forster, and D. Scaramuzza, "Automatic re-initialization and failure recovery for aggressive flight with a monocular vision-based quadrotor," in Proc. of the IEEE Int. Conf. on Robot. and Autom. IEEE, 2015, pp. 1722-1729.

[28] H. Strasdat, J. Montiel, and A. J. Davison, "Scale drift-aware large scale monocular slam," Robotics: Science and Systems VI, 2010.

[29] B. D. Lucas and T. Kanade, "An iterative image registration technique with an application to stereo vision," in Proc. of the Intl. Joint Conf. on Artificial Intelligence, Vancouver, Canada, Aug. 1981, pp. 24-28.

[30] J. Shi and C. Tomasi, "Good features to track," in Computer Vision and Pattern Recognition, 1994. Proceedings CVPR'94., 1994 IEEE Computer Society Conference on. IEEE, 1994, pp. 593-600.

[31] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge university press, 2003.

[32] A. Heyden and M. Pollefeys, "Multiple view geometry," Emerging Topics in Computer Vision, 2005.

[33] D. Nistér, "An efficient solution to the five-point relative pose problem," IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 6, pp. 756-770, 2004.

[34] T. Liu and S. Shen, "Spline-based initialization of monocular visual-inertial state estimators at high altitude," IEEE Robotics and Automation Letters, 2017, accepted.

[35] V. Lepetit, F. Moreno-Noguer, and P. Fua, "Epnp: An accurate o (n) solution to the pnp problem," International journal of computer vision, vol. 81, no. 2, pp. 155-166, 2009.

[36] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, "Bundle adjustmenta modern synthesis," in International workshop on vision algorithms. Springer, 1999, pp. 298-372.

[37] P. Huber, "Robust estimation of a location parameter," Annals of Mathematical Statistics, vol. 35, no. 2, pp. 73-101, 1964.

[38] S. Agarwal, K. Mierle, and Others, "Ceres solver," http://ceres-solver. org

[39] G. Sibley, L. Matthies, and G. Sukhatme, "Sliding window filter with application to planetary landing," J. Field Robot., vol. 27, no. 5, pp. 587-608, Sep. 2010.

[40] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, "Brief: Binary robust independent elementary features," Computer Vision-ECCV 2010, pp. 778-792, 2010.

[41] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, "The euroc micro aerial vehicle datasets," The International Journal of Robotics Research, 2016.

[42] C. Mei and P. Rives, "Single view point omnidirectional camera calibration from planar grids," in Robotics and Automation, 2007 IEEE International Conference on. IEEE, 2007, pp. 3945-3950.

[43] L. Heng, B. Li, and M. Pollefeys, "Camodocal: Automatic intrinsic and extrinsic calibration of a rig with multiple generic cameras and odometry," in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE, 2013, pp. 1793-1800.

[44] Y. Lin, F. Gao, T. Qin, W. Gao, T. Liu, W. Wu, Z. Yang, and S. Shen, "Autonomous aerial navigation using monocular visual-inertial fusion," J. Field Robot., vol. 00, pp. 1-29, 2017.

posted @ 2024-07-26 11:22 Zenith_Hugh 阅读(321) 评论(0) 收藏举报

刷新页面返回顶部

Zenith Hugh

We Go To The Moon