AirSLAM中英对照
来自:https://github.com/sair-lab/AirSLAM
AirSLAM: An Efficient and Illumination-Robust Point-Line Visual SLAM System
AirSLAM:一种高效且对光照鲁棒的点线视觉SLAM系统
Kuan \({\mathrm{{Xu}}}^{1}\) ,Yuefan \({\mathrm{{Hao}}}^{2}\) ,Shenghai Yuan \({}^{1}\) ,Chen Wang \({}^{2}\) ,Lihua Xie \({}^{1}\) ,Fellow,IEEE
Abstract-In this paper, we present an efficient visual SLAM system designed to tackle both short-term and long-term illumination challenges. Our system adopts a hybrid approach that combines deep learning techniques for feature detection and matching with traditional backend optimization methods. Specifically, we propose a unified convolutional neural network (CNN) that simultaneously extracts keypoints and structural lines. These features are then associated, matched, triangulated, and optimized in a coupled manner. Additionally, we introduce a lightweight relocalization pipeline that reuses the built map, where keypoints, lines, and a structure graph are used to match the query frame with the map. To enhance the applicability of the proposed system to real-world robots, we deploy and accelerate the feature detection and matching networks using C++ and NVIDIA TensorRT. Extensive experiments conducted on various datasets demonstrate that our system outperforms other state-of-the-art visual SLAM systems in illumination-challenging environments. Efficiency evaluations show that our system can run at a rate of \({73}\mathrm{\;{Hz}}\) on a PC and \({40}\mathrm{\;{Hz}}\) on an embedded platform.
摘要——本文提出了一种高效的视觉SLAM系统,旨在应对短期和长期光照挑战。我们的系统采用了一种混合方法,结合了深度学习技术进行特征检测和匹配以及传统的后端优化方法。
- 具体而言,我们提出了一种统一的卷积神经网络(CNN),同时提取关键点和结构线。这些特征随后以耦合的方式进行关联、匹配、三角化和优化。
- 此外,我们引入了一个轻量级的重定位流程,该流程重用了构建的地图,其中关键点、线条和结构图用于将查询帧与地图匹配。
- 为了增强所提出系统在真实世界机器人中的适用性,我们使用C++和NVIDIA TensorRT部署并加速了特征检测和匹配网络。
- 在各种数据集上进行的广泛实验表明,我们的系统在光照挑战环境中优于其他最先进的视觉SLAM系统。
- 效率评估显示,我们的系统在PC上可以以 \({73}\mathrm{\;{Hz}}\) 的速率运行,在嵌入式平台上可以以 \({40}\mathrm{\;{Hz}}\) 的速率运行。
Index Terms-Visual SLAM, Mapping, Relocalization.
索引词——视觉SLAM,地图构建,重定位。
I. 引言
Visual simultaneous localization and mapping (vSLAM) is essential for robot navigation due to its favorable balance between cost and accuracy [1]. Compared to LiDAR SLAM, vSLAM utilizes more cost-effective and compact sensors to achieve accurate localization, thus broadening its range of potential applications. Moreover, cameras can capture richer and more detailed information, which enhances their potential for providing robust localization.
视觉同时定位与地图构建(vSLAM)对于机器人导航至关重要,因为它在成本和精度之间取得了良好的平衡 [1]。与激光雷达SLAM相比,vSLAM利用成本更低且更紧凑的传感器实现精确的定位,从而扩大了其潜在应用范围。此外,相机能够捕捉更丰富和更详细的信息,增强了其提供稳健定位的潜力。
Despite the recent advancements, the present vSLAM systems still struggle with severe lighting conditions [2]-[5], which can be summarized into two categories. First, feature detection and tracking often fail due to drastic changes or low light, severely affecting the quality of the estimated trajectory [6], [7]. Second, when the visual map is reused for relocalization, lighting variations could significantly reduce the success rate [8], [9]. In this paper, we refer to the first issue as the short-term illumination challenge, which impacts pose estimation between two temporally adjacent frames, and the second as the long-term illumination challenge, which affects matching between the query frame and an existing map.
尽管近期有所进展,当前的vSLAM系统在极端光照条件下仍面临挑战[2]-[5],这些挑战可以归纳为两大类。
- 首先,由于剧烈变化或低光照,特征检测和跟踪常常失败,严重影响了估计轨迹的质量[6],[7]。
- 其次,当视觉地图被重用于重定位时,光照变化可能会显著降低成功率[8],[9]。
在本文中,我们将第一个问题称为短期光照挑战,它影响两个时间相邻帧之间的姿态估计,而第二个问题称为长期光照挑战,它影响查询帧与现有地图之间的匹配。
Present methods usually focus on only one of the above challenges. For example, various image enhancement [10]- [12] and image normalization algorithms [13], [14] have been developed to ensure robust tracking. These methods primarily focus on maintaining either global or local brightness consistency, yet they often fall short of handling all types of challenging lighting conditions [15]. Some systems have addressed this issue by training a VO or SLAM network on large datasets containing diverse lighting conditions [16]-[18]. However, they have difficulty producing a map suitable for long-term localization. Some methods can provide illumination-robust relocalization, but they usually require map building under good lighting conditions [19], [20]. In real-world robot applications, these two challenges often arise simultaneously, necessitating a unified system capable of addressing both.
现有方法通常只关注上述挑战之一。
- 例如,已经开发了各种图像增强[10]-[12]和图像归一化算法[13],[14]以确保鲁棒跟踪。这些方法主要侧重于保持全局或局部亮度一致性,但往往难以应对所有类型的挑战性光照条件[15]。
- 一些系统通过在包含多样光照条件的大型数据集上训练VO或SLAM网络来解决这一问题[16]-[18]。然而,它们难以生成适合长期定位的地图。
- 一些方法可以提供光照鲁棒的重定位,但通常需要在良好光照条件下构建地图[19],[20]。
在实际机器人应用中,这两个挑战常常同时出现,需要一个能够同时解决两者的统一系统。
Furthermore, many of the aforementioned systems incorporate intricate neural networks, relying on powerful GPUs to run in real time. They lack the efficiency necessary for deployment on resource-constrained platforms, such as warehouse robots. These limitations impede the transition of vSLAM from laboratory research to industrial applications.
此外,许多上述系统采用了复杂的神经网络,依赖强大的GPU来实时运行。它们缺乏在资源受限平台(如仓库机器人)上部署所需的效率。这些限制阻碍了vSLAM从实验室研究向工业应用的转变。
In response to these gaps, this paper introduces AirSLAM. Observing that line features can improve the accuracy and robustness of vSLAM systems [4], [21], [22], we integrate both point and line features for tracking, mapping, optimization, and relocalization. To achieve a balance between efficiency and performance, we design our system as a hybrid system, employing learning-based methods for feature detection and matching, and traditional geometric approaches for pose and map optimization. Additionally, to enhance the efficiency of feature detection, we developed a unified model capable of simultaneously detecting point and line features. We also address long-term localization challenges by proposing a multistage relocalization strategy, which effectively reuses our point-line map. In summary, our contributions include
针对这些空白,本文引入了AirSLAM。
- 观察到线特征可以提高vSLAM系统的准确性和鲁棒性[4]、[21]、[22],我们将点特征和线特征集成用于跟踪、建图、优化和重定位。
- 为了在效率和性能之间取得平衡,我们将系统设计为混合系统,采用基于学习的方法进行特征检测和匹配,以及传统的几何方法进行姿态和地图优化。
- 此外,为了提高特征检测的效率,我们开发了一个能够同时检测点特征和线特征的统一模型。
- 我们还通过提出一种多阶段重定位策略来解决长期定位挑战,该策略有效地重用了我们的点-线地图。
总之,我们的贡献包括
-
We propose a novel point-line-based vSLAM system that combines the efficiency of traditional optimization techniques with the robustness of learning-based methods. Our system is resilient to both short-term and long-term illumination challenges while remaining efficient enough for deployment on embedded platforms.
-
我们提出了一种新颖的基于点-线特征的vSLAM系统,该系统结合了传统优化技术的效率和基于学习方法的鲁棒性。我们的系统能够抵御短期和长期的照明挑战,同时保持足够的效率以部署在嵌入式平台上。
-
We have developed a unified model for both keypoint and line detection, which we call PLNet. To our knowledge, PLNet is the first model capable of simultaneously detecting both point and line features. Furthermore, we associate these two types of features and jointly utilize them for tracking, mapping, and relocalization tasks.
-
我们开发了一个用于关键点和线检测的统一模型,我们称之为PLNet。据我们所知,PLNet是首个能够同时检测点特征和线特征的模型。此外,我们将这两种特征关联起来,并联合利用它们进行跟踪、建图和重定位任务。
-
We propose a multi-stage relocalization method based on both point and line features, utilizing both appearance and geometry information. This method can provide fast and illumination-robust localization in an existing visual map using only a single image.
-
我们提出了一种基于点特征和线特征的多阶段重定位方法,利用外观和几何信息。该方法可以在现有视觉地图中仅使用单张图像提供快速且对光照鲁棒的定位。
-
We conduct extensive experiments to demonstrate the efficiency and effectiveness of the proposed methods. The results show that our system achieves accurate and robust mapping and relocalization performance under various illumination-challenging conditions. Additionally, our system is also very efficient. It runs at a rate of \({73}\mathrm{\;{Hz}}\) on a \(\mathrm{{PC}}\) and \({40}\mathrm{\;{Hz}}\) on an embedded platform.
-
我们进行了广泛的实验以证明所提出方法的效率和有效性。结果显示,我们的系统在各种光照挑战条件下实现了精确且鲁棒的建图和重定位性能。此外,我们的系统也非常高效。它在\(\mathrm{{PC}}\)上以\({73}\mathrm{\;{Hz}}\)的速率运行,在嵌入式平台上以\({40}\mathrm{\;{Hz}}\)的速率运行。
-
In addition, our engineering contributions include deploying and accelerating feature detection and matching networks using C++ and NVIDIA TensorRT,facilitating their deployment on real robots. We release all the C++ source code at https://github. com/sair-lab/AirSLAM to benefit the community.
-
此外,我们的工程贡献包括使用 C++ 和NVIDIA TensorRT部署和加速特征检测与匹配网络,便于它们在实际机器人上的部署。我们在 https://github.com/sair-lab/AirSLAM上 发布了所有 C++ 源代码,以造福社区。
This work is supported by the National Research Foundation of Singapore under its Medium-Sized Center for Advanced Robotics Technology Innovation.
本工作得到了新加坡国家研究基金会下属的先进机器人技术创新中型中心的支持。
\({}^{1}\) Kuan Xu,Shenghai Yuan,and Lihua Xie are with the Centre for Advanced Robotics Technology Innovation (CARTIN), School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, {kuan.xu, shyuan, elhxie}@ntu.edu.sg.
\({}^{1}\) 徐宽、袁慎海和谢立华来自南洋理工大学电气与电子工程学院的高级机器人技术创新中心(CARTIN),地址:新加坡南洋大道50号,邮编639798,电子邮箱:{kuan.xu, shyuan, elhxie}@ntu.edu.sg。
\({}^{2}\) Yuefan Hao and Chen Wang are with Spatial AI &Robotics Lab,Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY 14260, yuefan.hao@outlook.com, chenw@sairlab.org.
\({}^{2}\) 郝跃凡和王晨在布法罗大学计算机科学与工程系的Spatial AI & Robotics实验室工作,地址:纽约州布法罗市,邮编14260,电子邮箱:yuefan.hao@outlook.com, chenw@sairlab.org。
This paper extends our conference paper, AirVO [21]. AirVO utilizes SuperPoint [23] and LSD [24] for feature detection, and SuperGlue [25] for feature matching. It achieves remarkable performance in environments with changing illumination. However, as a visual-only odometry, it primarily addresses short-term illumination challenges and cannot reuse a map for drift-free relocalization. Additionally, despite carefully designed post-processing operations, the modified LSD is still not stable enough for long-term localization. It relies on image gradient information rather than environmental structural information, rendering it susceptible to varying lighting conditions. In this version, we introduce substantial improvements, including:
本文扩展了我们的会议论文,AirVO [21]。
- AirVO 利用 SuperPoint [23] 和 LSD [24] 进行特征检测,并使用 SuperGlue [25] 进行特征匹配。它在光照变化的环境中取得了显著的性能。
- 然而,作为一个仅视觉的里程计,它主要解决短期光照挑战,并且无法重用地图进行无漂移的重定位。
- 此外,尽管经过精心设计的后处理操作,修改后的 LSD 仍然不够稳定,无法用于长期定位。它依赖于图像梯度信息而不是环境结构信息,因此容易受到光照条件变化的影响。
在这个版本中,我们引入了大量改进,包括:
-
We design a unified CNN to detect both point and line features, enhancing the stability of feature detection in illumination-challenging environments. Additionally, the more efficient LightGlue [26] is used for feature matching.
-
我们设计了一个统一的 CNN 来检测点和线特征,增强了在光照挑战环境中的特征检测稳定性。此外,更高效的 LightGlue [26] 用于特征匹配。
-
We extend our system to support both stereo data and stereo-inertial data, increasing its reliability when an inertial measurement unit (IMU) is available.
-
我们将系统扩展为支持立体数据和立体惯性数据,当有惯性测量单元(IMU)可用时,提高了其可靠性。
-
We incorporate loop closure detection and map optimization, forming a complete vSLAM system.
-
我们加入了回环检测和地图优化,形成了一个完整的 vSLAM 系统。
-
We design a multi-stage relocalization module based on both point and line features, enabling our system to effectively handle long-term illumination challenges.
-
我们设计了一个基于点和线特征的多阶段重定位模块,使我们的系统能够有效应对长期光照挑战。
The remainder of this article is organized as follows. In Section II, we discuss the relevant literature. In Section III, we give an overview of the complete system pipeline. The proposed PLNet is presented in Section IV. In Section V, we introduce the visual-inertial odometry based on PLNet. In Section VI, we present how to optimize the map offline and reuse it online. The detailed experimental results are presented in Section VII to verify the efficiency, accuracy, and robustness of AirSLAM. This article is concluded with limitations in Section VIII.
本文的其余部分组织如下。
- 在第二节中,讨论了相关文献。
- 在第三节中,概述了完整的系统流程。
- 在第四节中,介绍了提出的 PLNet。
- 在第五节中,介绍了基于 PLNet 的视觉惯性里程计。
- 在第六节中,展示了如何离线优化地图并在线重用。
- 在第七节中,详细展示了实验结果,以验证 AirSLAM 的效率、准确性和鲁棒性。
- 在第八节中,总结了局限性。
II. 相关工作
A. Keypoint and Line Detection for vSLAM vSLAM 中的关键点检测和线检测
- Keypoint Detection: Various handcrafted keypoint features e.g., ORB [27], FAST [28], and BRISK [29], have been proposed and applied to VO and vSLAM systems. They are usually efficient but not robust enough in challenging environments [8], [25]. With the development of deep learning techniques, more and more learning-based features are proposed and used to replace the handcrafted features in vSLAM systems. Rong et al. [30] introduce TFeat network [31] to extract descriptors for FAST corners and apply it to a traditional vSLAM pipeline. Tang et al. [32] use a neural network to extract robust keypoints and binary feature descriptors with the same shape as the ORB. Han et al. [33] combine SuperPoint [23] feature extractor with a traditional back-end. Bruno et al. proposed LIFT-SLAM [34], where they use LIFT [35] to extract features. Li et al. [36] replace the ORB feature with SuperPoint in ORB-SLAM2 and optimize the feature extraction with the Intel OpenVINO toolkit. Some other learning-based features, e.g., R2D2 [37] and DISK [38], and D2Net [39], are also being attempted to be applied to vSLAM systems, although they are not yet efficient enough [40], [41].
- 关键点检测:
- 已经提出了多种手工制作的关键点特征,例如 ORB [27]、FAST [28] 和 BRISK [29],并应用于视觉里程计(VO)和视觉同时定位与地图构建(vSLAM)系统中。它们通常效率较高,但在具有挑战性的环境中不够鲁棒 [8],[25]。
- 随着深度学习技术的发展,越来越多的基于学习的特征被提出并用于替代 vSLAM 系统中的手工特征。
- Rong 等人 [30] 引入了 TFeat 网络 [31] 来提取 FAST 角点的描述符,并将其应用于传统的 vSLAM 流程中。
- Tang 等人 [32] 使用神经网络提取鲁棒的关键点和与 ORB 形状相同的二进制特征描述符。
- Han 等人 [33] 将 SuperPoint [23] 特征提取器与传统后端相结合。
- Bruno 等人提出了 LIFT-SLAM [34],其中他们使用 LIFT [35] 来提取特征。
- Li 等人 [36] 在 ORB-SLAM2 中用 SuperPoint 替换 ORB 特征,并使用 Intel OpenVINO 工具包优化特征提取。
- 其他一些基于学习的特征,例如 R2D2 [37] 和 DISK [38],以及 D2Net [39],也正在尝试应用于 vSLAM 系统,尽管它们的效率还不够高 [40],[41]。
- Line Detection: Currently, most point-line-based vS-LAM systems use the LSD [24] or EDLines [42] to detect line features because of their good efficiency [4], [43]-[46]. Although many learning-based line detection methods, e.g., LCNN [47], SOLD2 [48], and HAWP [49], have been proposed and shown better robustness in challenging environments, they are difficult to apply to real-time vSLAM systems due to lacking efficiency. For example, Kannapiran et al. propose StereoVO [22], where they choose SuperPoint [23] and SOLD2 [48] to detect keypoints and line segments, respectively. Despite achieving good performance in dynamic lighting conditions, StereoVO can only run at a rate of about \(7\mathrm{\;{Hz}}\) on a good GPU.
- 直线检测:
- 目前,大多数基于点-线的vS-LAM系统使用LSD [24]或EDLines [42]来检测直线特征,因为它们具有良好的效率 [4], [43]-[46]。
- 尽管许多基于学习的方法,例如LCNN [47]、SOLD2 [48]和HAWP [49],已经被提出并在具有挑战性的环境中显示出更好的鲁棒性,但由于缺乏效率,它们很难应用于实时vSLAM系统。
- 例如,Kannapiran等人提出了StereoVO [22],其中他们分别选择SuperPoint [23]和SOLD2 [48]来检测关键点和线段。尽管在动态光照条件下取得了良好的性能,StereoVO在良好GPU上的运行速率仅为约\(7\mathrm{\;{Hz}}\)。
B. Short-Term Illumination Challenge 短期光照挑战
Several handcrafted methods have been proposed to improve the robustness of VO and vSLAM to challenging illumination. DSO [50] models brightness changes and jointly optimizes camera poses and photometric parameters. DRMS [10] and AFE-ORB-SLAM [11] utilize various image enhancements. Some systems try different methods, such as ZNCC, the locally-scaled sum of squared differences (LSSD), and dense descriptor computation, to achieve robust tracking [13], [14], [51]. These methods mainly focus on either global or local illumination change for all kinds of images, however, lighting conditions often affect the scene differently in different areas [15]. Other related methods include that of Huang and Liu [52], which presents a multi-feature extraction algorithm to extract two kinds of image features when a single-feature algorithm fails to extract enough feature points. Kim et al. [53] employ a patch-based affine illumination model during direct motion estimation. Chen et al. [54] minimize the normalized information distance with nonlinear least square optimization for image registration. Alismail et al. [55] propose a binary feature descriptor using a descriptor assumption to avoid brightness constancy.
已经提出了几种手工制作的方法来提高VO和vSLAM在挑战性光照条件下的鲁棒性。
- DSO [50] 模型亮度变化并联合优化相机姿态和光度参数。
- DRMS [10] 和 AFE-ORB-SLAM [11] 利用各种图像增强技术。
- 一些系统尝试不同的方法,如ZNCC、局部缩放的平方差和(LSSD)以及密集描述符计算,以实现鲁棒跟踪 [13], [14], [51]。这些方法主要关注全局或局部光照变化对所有类型图像的影响,然而,光照条件通常在场景的不同区域产生不同的影响 [15]。
- 其他相关方法包括Huang和Liu [52] 提出的多特征提取算法,该算法在单特征算法无法提取足够特征点时提取两种图像特征。
- Kim等人 [53] 在直接运动估计过程中采用基于块的仿射光照模型。
- Chen等人 [54] 通过非线性最小二乘优化最小化归一化信息距离进行图像配准。
- Alismail等人 [55] 提出了一种使用描述符假设避免亮度恒定性的二进制特征描述符。
Compared with handcrafted methods, learning-based methods have shown better performance. Savinykh et al. [7] propose DarkSLAM, where Generative Adversarial Network (GAN) [56] is used to enhance input images. Pratap Singh et al. [57] compare different learning-based image enhancement methods for vSLAM in low-light environments. TartanVO [16], DROID-SLAM [17], and iSLAM [18] train their VO or SLAM networks on the TartanAir dataset [58], which is a large simulation dataset that contains various lighting conditions, therefore, they are very robust in challenging environments. However, they usually require good GPUs and long training times. Besides, DROID-SLAM runs very slowly and is difficult to apply to real-time applications on resource-constrained platforms. TartanVO and iSLAM are more efficient, but they cannot achieve performance as accurately as traditional vSLAM systems.
与手工制作的方法相比,基于学习的方法已显示出更好的性能。
- Savinykh等人[7]提出了DarkSLAM,其中生成对抗网络(GAN)[56]用于增强输入图像。
- Pratap Singh等人[57]比较了在低光环境下用于vSLAM的不同基于学习的图像增强方法。
- TartanVO[16]、DROID-SLAM[17]和iSLAM[18]在其VO或SLAM网络中训练了TartanAir数据集[58],这是一个包含各种光照条件的大型模拟数据集,因此它们在挑战性环境中非常健壮。
- 然而,它们通常需要良好的GPU和长时间的训练。
- 此外,DROID-SLAM运行非常缓慢,难以应用于资源受限平台上的实时应用。
- TartanVO和iSLAM效率更高,但它们无法达到与传统vSLAM系统一样准确的性能。

Fig. 1. The proposed system consists of three main parts: online stereo VO/VIO, offline map optimization, and online relocalization. The VO/VIO module uses the mapping image sequences to build an initial map. Then the initial map is processed offline and an optimized map is outputted. The optimized map can be used for the one-shot relocalization.
图1. 所提出的系统由三个主要部分组成:在线立体VO/VIO、离线地图优化和在线重定位。VO/VIO模块使用映射图像序列构建初始地图。然后对初始地图进行离线处理并输出优化地图。优化地图可用于一次性重定位。
C. Long-Term Illumination Challenge 长期光照挑战
Currently, most SLAM systems still use the bag of words (BoW) [59] for loop closure detection and relocalization due to its good balance between efficiency and effectiveness [3], [60], [61]. To make the relocalization more robust to large illumination variations, Labbé et al. [19] propose the multi-session relocalization method, where they combine multiple maps generated at different times and in various illumination conditions. DXSLAM [36] trains a vocabulary for SuperPoint and uses both BoW and NetVLAD [62] for relocalization.
- 目前,大多数SLAM系统仍然使用词袋(BoW)[59]进行回环检测和重定位,因为它在效率和有效性之间有很好的平衡[3]、[60]、[61]。
- 为了使重定位对大的光照变化更加鲁棒,Labbé等人[19]提出了多会话重定位方法,他们结合了在不同时间和各种光照条件下生成的多个地图。
- DXSLAM[36]为SuperPoint训练了一个词汇表,并使用BoW和NetVLAD[62]进行重定位。
Another similar task in the robotics and computer vision communities is the visual place recognition (VPR) problem, where many researchers handle the localization problem with image retrieval methods [62], [63]. These VPR solutions try to find images most similar to the query image from a database. They usually cannot directly provide accurate pose estimation which is needed in robot applications. Sarlin et al. address this and propose Hloc [8]. They use a global retrieval to obtain several candidates and match local features within those candidates. The Hloc toolbox has integrated many image retrieval methods, local feature extractors, and matching methods, and it is currently the SOTA system. Yan et al. [64] propose a long-term visual localization method for mobile platforms, however, they rely on other sensors, e.g., GPS, compass, and gravity sensor, for the coarse location retrieval.
机器人和计算机视觉领域的另一个类似任务是视觉地点识别(VPR)问题,许多研究人员使用图像检索方法处理定位问题[62],[63]。这些VPR解决方案试图从数据库中找到与查询图像最相似的图像。它们通常无法直接提供机器人应用所需的精确姿态估计。
- Sarlin等人解决了这一问题,并提出了Hloc[8]。他们使用全局检索来获取多个候选对象,并在这些候选对象中匹配局部特征。Hloc工具箱集成了许多图像检索方法、局部特征提取器和匹配方法,目前它是SOTA系统。
- Yan等人[64]提出了一种用于移动平台的长时视觉定位方法,然而,他们依赖其他传感器,例如GPS、指南针和重力传感器,进行粗略位置检索。
III. System Overview
III. 系统概述
We believe that a practical vSLAM system should possess the following features:
- High efficiency. The system should have real-time performance on resource-constrained platforms.
- Scalability. The system should be easily extensible for various purposes and real-world applications.
-
- Easy to deploy. The system should be easy to deploy on real robots and capable of achieving robust localization.
我们认为一个实用的vSLAM系统应具备以下特点:
- 高效率。系统应在资源受限的平台上实现实时性能。
- 可扩展性。系统应易于扩展以适应各种目的和实际应用。
- 易于部署。系统应易于在真实机器人上部署,并能够实现稳健的定位。

Fig. 2. We visualize the feature map (top right) and detected keypoints (bottom left) of a keypoint detection model, and the detected structural lines (bottom right) of a line detection model. The overlap of keypoints and junctions, and the edge information in the feature map inspire the design of our PLNet.
图2. 我们可视化了关键点检测模型的特征图(右上)和检测到的关键点(左下),以及线检测模型检测到的结构线(右下)。关键点和交点的重叠,以及特征图中的边缘信息启发了我们PLNet的设计。
Therefore, we design a system as shown in Fig. 1. The proposed system is a hybrid system as we need the robustness of data-driven approaches and the accuracy of geometric methods. It consists of three main components: stereo VO/VIO, offline map optimization, and lightweight relocalization. (1) Stereo VO/VIO: We propose a point-line-based visual odometry that can handle both stereo and stereo-inertial inputs. (2) Offline map optimization: We implement several commonly used plugins, such as loop detection, pose graph optimization, and global bundle adjustment. The system is easily extensible for other map-processing purposes by adding customized plugins. For example, we have implemented a plugin to train a scene-dependent junction vocabulary using the endpoints of line features, which is utilized in our lightweight multistage relocalization. (3) Lightweight relocalization: We propose a multi-stage relocalization method that improves efficiency while maintaining effectiveness. In the first stage, keypoints and line features are detected using the proposed PLNet, and several candidates are retrieved using a keypoint vocabulary trained on a large dataset. In the second stage, most false candidates are quickly filtered out using a scene-dependent junction vocabulary and a structure graph. In the third stage, feature matching is performed between the query frame and the remaining candidates to find the best match and estimate the pose of the query frame. Since feature matching in the third stage is typically time-consuming, the filtering process in the second stage enhances the efficiency of our system compared to other two-stage relocalization systems.
因此,我们设计了一个如图1所示的系统。
-
所提出的系统是一个混合系统,因为我们既需要数据驱动方法的鲁棒性,也需要几何方法的准确性。
-
它由三个主要组件组成:立体视觉里程计/视觉惯性里程计(VO/VIO)、离线地图优化和轻量级重定位。
- (1)立体视觉里程计/视觉惯性里程计(VO/VIO):我们提出了一种基于点线特征的视觉里程计,能够处理立体和立体惯性输入。
- (2)离线地图优化:我们实现了几种常用的插件,如回环检测、位姿图优化和全局束调整。该系统通过添加定制插件易于扩展以实现其他地图处理目的。例如,我们实现了一个插件,使用线特征的端点训练场景依赖的交叉口词汇,这在我们的轻量级多阶段重定位中得到应用。
- (3)轻量级重定位:我们提出了一种多阶段重定位方法,该方法在保持有效性的同时提高了效率。
-
在第一阶段,使用提出的PLNet检测关键点和线特征,并使用在大数据集上训练的关键点词汇检索多个候选。
-
在第二阶段,使用场景依赖的交叉口词汇和结构图快速过滤掉大部分错误候选。
-
在第三阶段,在查询帧和剩余候选之间进行特征匹配,以找到最佳匹配并估计查询帧的位姿。由于第三阶段的特征匹配通常耗时较长,第二阶段的过滤过程提高了我们系统的效率,相比于其他两阶段重定位系统。
We transfer some time-consuming processes, e.g., loop closure detection, pose graph optimization, and global bundle adjustment, to the offline stage. This improves the efficiency of our online mapping module. In many practical applications, such as warehouse robotics, a map is typically built by one robot and then reused by others. Our system is designed with these applications in mind. The lightweight mapping and map reuse modules can be easily deployed on resource-constrained robots, while the offline optimization module can run on a more powerful computer for various map manipulations, such as map editing and visualization. The mapping robot uploads the initial map to the computer, which then distributes the optimized map to other robots, ensuring drift-free relocalization. In the following sections, we introduce our feature detection and visual odometry (VO) pipeline in Section IV and Section V, respectively. The offline optimization and relocalization modules are presented in Section VI.
我们将一些耗时的过程,例如回环检测、位姿图优化和全局束调整,转移到离线阶段。这提高了我们在线建图模块的效率。在许多实际应用中,如仓库机器人,地图通常由一个机器人构建,然后被其他机器人重复使用。我们的系统正是针对这些应用设计的。轻量级建图和地图重用模块可以轻松部署在资源受限的机器人上,而离线优化模块可以在更强大的计算机上运行,进行各种地图操作,如地图编辑和可视化。建图机器人将初始地图上传到计算机,计算机再将优化后的地图分发给其他机器人,确保无漂移的重定位。在接下来的部分中,我们将在第四节和第五节分别介绍我们的特征检测和视觉里程计(VO)流程。离线优化和重定位模块将在第六节中介绍。
IV. 特征检测
A. Motivation 动机
With advancements in deep learning technology, learning-based feature detection methods have demonstrated more stable performance in illumination-challenging environments compared to traditional methods. However, existing point-line-based VO/VIO and SLAM systems typically detect key-points and line features separately. While it is acceptable for handcrafted methods due to their efficiency, the simultaneous application of keypoint detection and line detection networks in VO/VIO or SLAM systems, especially in stereo configurations, often hinders real-time performance on resource-constrained platforms. Consequently, we aim to design an efficient unified model that can detect keypoints and line features concurrently.
随着深度学习技术的发展,基于学习的特征检测方法在光照挑战性环境中相比传统方法表现出更稳定的性能。然而,现有的基于点线特征的VO/VIO和SLAM系统通常分别检测关键点和线段特征。对于手工方法来说,由于其效率,这是可以接受的,但在VO/VIO或SLAM系统中同时应用关键点检测和线段检测网络,尤其是在立体配置中,往往会在资源受限的平台上阻碍实时性能。因此,我们旨在设计一个能够同时检测关键点和线段特征的高效统一模型。
However, achieving a unified model for keypoint and line detection is challenging, as these tasks typically require different real-image datasets and training procedures. Keypoint detection models are generally trained on large datasets comprising diverse images and depend on either a boosting step or the correspondences of image pairs for training [23], [37], [38]. For line detection, we find wireframe parsing methods [47], [49] can provide stronger geometric cues than the self-supervised models [48], [65] as they are able to detect longer and more complete lines, however, these methods are trained on the Wireframe dataset [66], which is limited in size with only 5,462 discontinuous images. In the following sections, we will address this challenge and demonstrate how to train a unified model capable of performing both tasks. It is important to note that in this paper, the term "line detection" refers specifically to the wireframe parsing task.
然而,实现一个统一的模型来进行关键点和线条检测是具有挑战性的,因为这些任务通常需要不同的真实图像数据集和训练过程。
- 关键点检测模型通常在大规模包含多样化图像的数据集上进行训练,并依赖于提升步骤或图像对之间的对应关系进行训练[23],[37],[38]。
- 对于线条检测,我们发现线框解析方法[47],[49]比自监督模型[48],[65]能提供更强的几何线索,因为它们能够检测到更长和更完整的线条,然而,这些方法是在Wireframe数据集[66]上训练的,该数据集规模有限,仅有5,462张不连续的图像。
在接下来的章节中,我们将解决这一挑战,并展示如何训练一个能够执行这两项任务的统一模型。需要注意的是,在本文中,“线条检测”一词特指线框解析任务。
B. Architecture Design 架构设计
As shown in Fig. 2, we have two findings when visualizing the results of the keypoint and line detection networks: (1) Most junctions (endpoints of lines) detected by the line detection model are also selected as keypoints by the keypoint detection model. (2) The feature maps outputted by the keypoint detection model contain the edge information. Therefore, we argue that a line detection model can be built on the backbone of a pre-trained keypoint detection model. Based on this assumption, we design the PLNet to detect keypoints and lines in a unified framework. As shown in Fig. 3, it consists of the shared backbone, the keypoint module, and the line module.
如图2所示,在可视化关键点和线条检测网络的结果时,我们有两个发现:
(1) 线条检测模型检测到的大多数连接点(线条的端点)也被关键点检测模型选为关键点。
(2) 关键点检测模型输出的特征图包含了边缘信息。
因此,我们认为可以在预训练的关键点检测模型的基础上构建线条检测模型。基于这一假设,我们设计了PLNet,在一个统一的框架中检测关键点和线条。如图3所示,它包括共享的主干网络、关键点模块和线条模块。

Fig. 3. The framework of the proposed PLNet. It consists of the shared backbone, the keypoint module, and the line module.
图3. 提出的PLNet框架。它包括共享主干、关键点模块和线段模块。
Backbone: We follow SuperPoint [23] to design the backbone for its good efficiency and effectiveness. It uses 8 convolutional layers and 3 max-pooling layers. The input is the grayscale image sized \(H \times W\) . The outputs are \(H \times W \times {64}\) , \(\frac{H}{2} \times \frac{W}{2} \times {64},\frac{H}{4} \times \frac{W}{4} \times {128},\frac{H}{8} \times \frac{W}{8} \times {128}\) feature maps.
骨干网络:我们遵循 SuperPoint [23] 设计骨干网络,因其良好的效率和有效性。它使用 8 个卷积层和 3 个最大池化层。输入是尺寸为 \(H \times W\) 的灰度图像。输出是 \(H \times W \times {64}\)、\(\frac{H}{2} \times \frac{W}{2} \times {64},\frac{H}{4} \times \frac{W}{4} \times {128},\frac{H}{8} \times \frac{W}{8} \times {128}\) 特征图。
Keypoint Module: We also follow SuperPoint [23] to design the keypoint detection header. It has two branches: the score branch and the descriptor branch. The inputs are \(\frac{H}{8} \times \frac{W}{8} \times {128}\) feature maps outputted by the backbone. The score branch outputs a tensor sized \(\frac{H}{8} \times \frac{W}{8} \times {65}\) . The 65 channels correspond to an \(8 \times 8\) grid region and a dustbin indicating no keypoint. The tensor is processed by a softmax and then resized to \(H \times W\) . The descriptor branch outputs a tensor sized \(\frac{H}{8} \times \frac{W}{8} \times {256}\) ,which is used for interpolation to compute descriptors of keypoints.
关键点模块:我们也遵循 SuperPoint [23] 设计关键点检测头部。它有两个分支:分数分支和描述子分支。输入是骨干网络输出的 \(\frac{H}{8} \times \frac{W}{8} \times {128}\) 特征图。分数分支输出一个尺寸为 \(\frac{H}{8} \times \frac{W}{8} \times {65}\) 的张量。65 个通道对应于一个 \(8 \times 8\) 网格区域和一个表示无关键点的垃圾箱。该张量经过 softmax 处理后调整大小为 \(H \times W\)。描述子分支输出一个尺寸为 \(\frac{H}{8} \times \frac{W}{8} \times {256}\) 的张量,用于插值计算关键点的描述子。
Line Module: This module takes \(\frac{H}{4} \times \frac{W}{4} \times {128}\) feature maps as inputs. It consists of a U-Net-like CNN and the line detection header. We modify the U-Net [67] to make it contain fewer convolutional layers and thus be more efficient. The U-Net-like CNN is to increase the receptive field as detecting lines requires a larger receptive field than detecting keypoints. The EPD LOIAlign [49] is used to process the outputs of the line module and finally outputs junctions and lines.
线段模块:该模块以 \(\frac{H}{4} \times \frac{W}{4} \times {128}\) 特征图作为输入。它由一个类似 U-Net 的 CNN 和线段检测头部组成。我们修改了 U-Net [67],使其包含较少的卷积层,从而提高效率。类似 U-Net 的 CNN 旨在增加感受野,因为检测线段比检测关键点需要更大的感受野。EPD LOIAlign [49] 用于处理线段模块的输出,最终输出节点和线段。
C. 网络训练
Due to the training problem described in Section IV-A and the assumption in Section IV-B, we train our PLNet in two rounds. In the first round, only the backbone and the keypoint detection module are trained, which means we need to train a keypoint detection network. In the second round, the backbone and the keypoint detection module are fixed, and we only train the line detection module on the Wireframe dataset. We skip the details of the first round as they are very similar to [23]. Instead, we present the training of the line detection module.
由于第四节A部分描述的训练问题和第四节B部分的假设,我们分两轮训练我们的PLNet。
- 在第一轮中,仅训练主干网络和关键点检测模块,这意味着我们需要训练一个关键点检测网络。
- 在第二轮中,主干网络和关键点检测模块保持固定,我们仅在Wireframe数据集上训练线条检测模块。
我们跳过第一轮的细节,因为它们与[23]非常相似。相反,我们介绍线条检测模块的训练。
Line Encoding: We adopt the attraction region field [49] to encode line segments. For a line segment \(\mathbf{l} = \left( {{\mathbf{x}}_{\mathbf{1}},{\mathbf{x}}_{\mathbf{2}}}\right)\) ,where \({\mathbf{x}}_{\mathbf{1}}\) and \({\mathbf{x}}_{\mathbf{2}}\) are two endpoints of \(\mathbf{l}\) ,and a point \(\mathbf{p}\) in the attraction region of \(\mathbf{l}\) ,four parameters and \(\mathbf{p}\) are used to encode \(\mathbf{l}\) :
线条编码:我们采用吸引区域场[49]来编码线段。对于一条线段 \(\mathbf{l} = \left( {{\mathbf{x}}_{\mathbf{1}},{\mathbf{x}}_{\mathbf{2}}}\right)\),其中 \({\mathbf{x}}_{\mathbf{1}}\) 和 \({\mathbf{x}}_{\mathbf{2}}\) 是 \(\mathbf{l}\) 的两个端点,以及在 \(\mathbf{l}\) 的吸引区域内的一个点 \(\mathbf{p}\),使用四个参数和 \(\mathbf{p}\) 来编码 \(\mathbf{l}\):
where \(d = \left| \mathbf{{po}}\right|\) and \(\mathbf{o}\) is the foot of the perpendicular. \(\theta\) is the angle between \(\mathbf{l}\) and the \(\mathbf{Y}\) -axis of the image. \({\theta }_{1}\) is the angle between between \({\mathbf{{px}}}_{\mathbf{1}}\) and \(\mathbf{{po}}.{\theta }_{2}\) is the angle between \({\mathbf{{px}}}_{\mathbf{2}}\) and po. The network can predict these four parameters for point \(\mathbf{p}\) and then I can be decoded through:
其中 \(d = \left| \mathbf{{po}}\right|\) 和 \(\mathbf{o}\) 是垂足。\(\theta\) 是 \(\mathbf{l}\) 与图像的 \(\mathbf{Y}\) 轴之间的角度。\({\theta }_{1}\) 是 \({\mathbf{{px}}}_{\mathbf{1}}\) 和 \(\mathbf{{po}}.{\theta }_{2}\) 之间的角度,\({\mathbf{{px}}}_{\mathbf{2}}\) 是 \(\mathbf{p}\) 和 po 之间的角度。网络可以预测点 \(\mathbf{p}\) 的这四个参数,然后可以通过以下方式解码 I:
Line Prediction: The line detection module outputs a tensor sized \(\frac{H}{4} \times \frac{W}{4} \times 4\) to predict parameters in (1) and a heatmap to predict junctions. For each decoded line segment by (2), two junctions closest to its endpoints will be selected to form a line proposal with it. Proposals with the same junctions will be deduplicated and only one is retained. Then the EPD LOIAlign [49] and a head classifier are applied to decide whether the line proposal is a true line feature.
线段预测:线段检测模块输出一个大小为 \(\frac{H}{4} \times \frac{W}{4} \times 4\) 的张量来预测公式 (1) 中的参数,并输出一个热图来预测节点。对于通过公式 (2) 解码的每个线段,将选择其端点附近最近的两个节点与其形成一个线段提议。具有相同节点的提议将被去重,仅保留一个。然后应用 EPD LOIAlign [49] 和头部分类器来决定该线段提议是否为真正的线段特征。
Line Module Training: We use the \({L1}\) loss to supervise the prediction of parameters in (1) and the binary cross-entropy loss to supervise the junction heatmap and the head classifier. The total loss is the sum of them. As shown in Fig. 4, to improve the robustness of line detection in illumination-challenging environments, seven types of photometric data augmentation are applied to process training images. The training uses the ADAM optimizer [68] with the learning rate \({lr} = {4e} - 4\) in the first 35 epochs and \({lr} = {4e} - 5\) in the last 5 epochs.
线段模块训练:我们使用 \({L1}\) 损失来监督公式 (1) 中参数的预测,使用二元交叉熵损失来监督节点热图和头部分类器。总损失是它们的和。如图 4 所示,为了提高在光照挑战环境下的线段检测鲁棒性,应用了七种类型的光度数据增强来处理训练图像。训练使用 ADAM 优化器 [68],在前 35 个周期中学习率为 \({lr} = {4e} - 4\),在后 5 个周期中学习率为 \({lr} = {4e} - 5\)。

Fig. 4. We use seven types of photometric data augmentation to train our PLNet to make it more robust to challenging illumination.
图4. 我们使用了七种类型的光度数据增强来训练我们的PLNet,使其对具有挑战性的光照条件更加鲁棒。
V. Stereo Visual Odometry 立体视觉里程计
A. 概述

Fig. 5. The framework of our visual(-inertial) odometry. The system is split into two main threads, which are represented by two different colored regions. Note that the IMU input is not strictly required. The system is optional to use stereo data or stereo-inertial data.
图5. 我们的视觉(-惯性)里程计框架。系统分为两个主要线程,分别用两种不同颜色的区域表示。请注意,IMU输入并不是严格必需的。系统可以选择使用立体数据或立体-惯性数据。
The proposed point-line-based stereo visual odometry is shown in Fig. 5. It is a hybrid VO system utilizing both the learning-based front-end and the traditional optimization backend. For each stereo image pair, we first employ the proposed PLNet to extract keypoints and line features. Then a GNN (LightGlue [26]) is used to match keypoints. In parallel, we associate line features with keypoints and match them using the keypoint matching results. After that, we perform an initial pose estimation and reject outliers. Based on the results, we triangulate the \(2\mathrm{D}\) features of keyframes and insert them into the map. Finally, the local bundle adjustment will be performed to optimize points, lines, and keyframe poses. In the meantime, if an IMU is accessible, its measurements will be processed using the IMU preintegration method [69], and added to the initial pose estimation and local bundle adjustment.
提出的基于点线的立体视觉里程计如图5所示。这是一个混合VO系统,结合了基于学习的前端和传统的优化后端。
- 对于每一对立体图像,我们首先使用提出的PLNet来提取关键点和线特征。
- 然后使用GNN(LightGlue [26])来匹配关键点。
- 同时,我们将线特征与关键点关联,并利用关键点匹配结果进行匹配。
- 之后,我们进行初始姿态估计并剔除外点。
基于这些结果,我们对关键帧的\(2\mathrm{D}\)特征进行三角测量并将其插入地图中。最后,将执行局部束调整以优化点、线和关键帧姿态。在此期间,如果可以访问IMU,其测量值将使用IMU预积分方法[69]进行处理,并添加到初始姿态估计和局部束调整中。
Applying both learning-based feature detection and matching methods to the stereo VO is time-consuming. Therefore, to improve efficiency, the following three techniques are utilized in our system. (1) For keyframes, we extract features on both left and right images and perform stereo matching to estimate the real scale. But for non-keyframes, we only process the left image. Besides, we use some lenient criteria to make the selected keyframes in our system very sparse, so the runtime and resource consumption of feature detection and matching in our system are close to that of a monocular system. (2) We convert the inference code of the CNN and GNN from Python to C++, and deploy them using ONNX and NVIDIA TensorRT, where the 16-bit floating-point arithmetic replaces the 32-bit floating-point arithmetic. (3) We design a multi-thread pipeline. A producer-consumer model is used to split the system into two main threads, i.e., the front-end thread and the backend thread. The front-end thread extracts and matches features while the backend thread performs the initial pose estimation, keyframe insertion, and local bundle adjustment.
将基于学习的特征检测和匹配方法应用于立体视觉里程计(VO)是耗时的。因此,为了提高效率,我们的系统采用了以下三种技术。
- (1)对于关键帧,我们在左右图像上提取特征并进行立体匹配以估计真实尺度。但对于非关键帧,我们只处理左图像。此外,我们使用一些宽松的标准使系统中选定的关键帧非常稀疏,因此特征检测和匹配的运行时间和资源消耗接近单目系统。
- (2)我们将CNN和GNN的推理代码从Python转换为C++,并使用ONNX和NVIDIA TensorRT进行部署,其中16位浮点运算取代了32位浮点运算。
- (3)我们设计了一个多线程流水线。采用生产者-消费者模型将系统分为两个主要线程,即前端线程和后端线程。前端线程提取和匹配特征,而后端线程执行初始姿态估计、关键帧插入和局部束调整。
B. 特征匹配
We use LightGlue [26] to match keypoints. For line features, most of the current VO and SLAM systems use the LBD algorithm [70] or tracking sample points to match them. However, the LBD algorithm extracts the descriptor from a local band region of the line, so it suffers from unstable line detection due to challenging illumination or viewpoint changes. Tracking sample points can match the line detected with different lengths in two frames, but current SLAM systems usually use optical flow to track the sample points, which have a bad performance when the light conditions change rapidly or violently. Some learning-based line feature descriptors [48] are also proposed, however, they are rarely used in current SLAM systems due to the increased time complexity.
我们使用LightGlue [26]来匹配关键点。对于线特征,当前大多数VO和SLAM系统使用LBD算法 [70] 或跟踪采样点来匹配它们。然而,LBD算法从线的局部带区域提取描述符,因此由于光照或视角变化,它受到不稳定的线检测影响。跟踪采样点可以在两帧中匹配不同长度的检测线,但当前SLAM系统通常使用光流来跟踪采样点,这在光照条件快速或剧烈变化时性能不佳。一些基于学习的线特征描述符 [48] 也被提出,但由于增加了时间复杂度,它们在当前SLAM系统中很少使用。
Therefore, to address both the effectiveness problem and efficiency problem, we design a fast and robust line-matching method for illumination-challenging conditions. First, we associate keypoints with line segments through their distances. Assume that \(M\) keypoints and \(N\) line segments are detected on the image, where each keypoint is denoted as \({\mathbf{p}}_{i} = \left( {{x}_{i},{y}_{i}}\right)\) and each line segment is denoted as \({\mathbf{l}}_{j} =\) \(\left( {{A}_{j},{B}_{j},{C}_{j},{x}_{j,1},{y}_{j,1},{x}_{j,2},{y}_{j,2}}\right)\) ,where \(\left( {{A}_{j},{B}_{j},{C}_{j}}\right)\) are line parameters of \({\mathbf{l}}_{j}\) and \(\left( {{x}_{j,1},{y}_{j,1},{x}_{j,2},{y}_{j,2}}\right)\) are the endpoints.
因此,为了解决有效性和效率问题,我们设计了一种快速且鲁棒的线匹配方法,适用于光照挑战性条件。首先,我们通过距离将关键点与线段关联起来。假设在图像上检测到 \(M\) 个关键点和 \(N\) 条线段,其中每个关键点表示为 \({\mathbf{p}}_{i} = \left( {{x}_{i},{y}_{i}}\right)\),每条线段表示为 \({\mathbf{l}}_{j} =\) \(\left( {{A}_{j},{B}_{j},{C}_{j},{x}_{j,1},{y}_{j,1},{x}_{j,2},{y}_{j,2}}\right)\),其中 \(\left( {{A}_{j},{B}_{j},{C}_{j}}\right)\) 是 \({\mathbf{l}}_{j}\) 的线段参数,\(\left( {{x}_{j,1},{y}_{j,1},{x}_{j,2},{y}_{j,2}}\right)\) 是端点。
We first compute the distance between \({\mathbf{p}}_{i}\) and \({\mathbf{l}}_{j}\) through:
我们首先计算 \({\mathbf{p}}_{i}\) 和 \({\mathbf{l}}_{j}\) 之间的距离:
If \({d}_{ij} < 3\) and the projection of \({\mathbf{p}}_{i}\) on the coordinate axis lies within the projections of line segment endpoints, i.e., \(\min \left( {{x}_{j,1},{x}_{j,2}}\right) \leq {x}_{i} \leq \max \left( {{x}_{j,1},{x}_{j,2}}\right)\) or \(\min \left( {{y}_{j,1},{y}_{j,2}}\right) \leq\) \({y}_{i} \leq \max \left( {{y}_{j,1},{y}_{j,2}}\right)\) ,we will say \({\mathbf{p}}_{i}\) belongs to \({\mathbf{l}}_{j}\) . Then the line segments on two images can be matched based on the point-matching result of these two images. For \({\mathbf{l}}_{k,m}\) on image \(k\) and \({\mathbf{l}}_{k + 1,n}\) on image \(k + 1\) ,we compute a score \({S}_{mn}\) to represent the confidence of that they are the same line:
如果 \({d}_{ij} < 3\) 和 \({\mathbf{p}}_{i}\) 在坐标轴上的投影位于线段端点的投影范围内,即 \(\min \left( {{x}_{j,1},{x}_{j,2}}\right) \leq {x}_{i} \leq \max \left( {{x}_{j,1},{x}_{j,2}}\right)\) 或 \(\min \left( {{y}_{j,1},{y}_{j,2}}\right) \leq\) \({y}_{i} \leq \max \left( {{y}_{j,1},{y}_{j,2}}\right)\),我们将认为 \({\mathbf{p}}_{i}\) 属于 \({\mathbf{l}}_{j}\)。然后,可以根据这两幅图像的点匹配结果来匹配两幅图像上的线段。对于图像 \(k\) 上的 \({\mathbf{l}}_{k,m}\) 和图像 \(k + 1\) 上的 \({\mathbf{l}}_{k + 1,n}\),我们计算一个分数 \({S}_{mn}\) 来表示它们是同一条线的置信度:
where \({N}_{pm}\) is the matching number between point features belonging to \({\mathbf{l}}_{k,m}\) and point features belonging to \({\mathbf{l}}_{k + 1,n}\) . \({N}_{k,m}\) and \({N}_{k + 1,n}\) are the numbers of point features belonging to \({\mathbf{l}}_{k,m}\) and \({\mathbf{l}}_{k + 1,n}\) ,respectively. Then if \({S}_{mn} > {\delta }_{S}\) and \({N}_{pm} > {\delta }_{N}\) ,where \({\delta }_{S}\) and \({\delta }_{N}\) are two preset thresholds,we will regard \({\mathbf{l}}_{k,m}\) and \({\mathbf{l}}_{k + 1,n}\) as the same line. This coupled feature matching method allows our line matching to share the robust performance of keypoint matching while being highly efficient due to that it does not need another line-matching network.
其中 \({N}_{pm}\) 是属于 \({\mathbf{l}}_{k,m}\) 的点特征与属于 \({\mathbf{l}}_{k + 1,n}\) 的点特征之间的匹配数量。\({N}_{k,m}\) 和 \({N}_{k + 1,n}\) 分别是属于 \({\mathbf{l}}_{k,m}\) 和 \({\mathbf{l}}_{k + 1,n}\) 的点特征数量。然后,如果 \({S}_{mn} > {\delta }_{S}\) 和 \({N}_{pm} > {\delta }_{N}\),其中 \({\delta }_{S}\) 和 \({\delta }_{N}\) 是两个预设阈值,我们将认为 \({\mathbf{l}}_{k,m}\) 和 \({\mathbf{l}}_{k + 1,n}\) 是同一条线。这种耦合特征匹配方法使我们的线匹配能够共享关键点匹配的鲁棒性能,同时由于不需要额外的线匹配网络而具有高效率。
C. 三维特征处理
In this part, we will introduce our 3D feature processing methods, including 3D feature representation, triangulation, i.e.,constructing \(3\mathrm{D}\) features from \(2\mathrm{D}\) features,and re-projection, i.e., projecting 3D features to the image plane. We skip the details of 3D point processing in our system as they are easy to do and similar to other point-based VO and SLAM systems. On the contrary, compared with 3D points, 3D lines have more degrees of freedom, and they are easier to degenerate when being triangulated. Therefore, the 3D line processing will be illustrated in detail.
在本部分中,我们将介绍我们的三维特征处理方法,包括三维特征表示、三角测量,即从 \(2\mathrm{D}\) 特征构建 \(3\mathrm{D}\) 特征,以及重投影,即将三维特征投影到图像平面上。我们跳过了系统中三维点处理的细节,因为它们易于实现且与其他基于点的视觉里程计和SLAM系统类似。相反,与三维点相比,三维线具有更多的自由度,并且在进行三角测量时更容易退化。因此,三维线处理将详细说明。
-
3D Line Representation: We use Plücker coordinates [71] to represent a 3D spatial line:
-
三维线表示:我们使用普吕克坐标 [71] 来表示三维空间中的线:
where \(\mathbf{v}\) is the direction vector of the line and \(\mathbf{n}\) is the normal vector of the plane determined by the line and the origin. Plücker coordinates are used for 3D line triangulation, transformation, and projection. It is over-parameterized because it is a 6-dimensional vector, but a 3D line has only four degrees of freedom. In the graph optimization stage, the extra degrees of freedom will increase the computational cost and cause the numerical instability of the system [72]. Therefore, we also use orthonormal representation [71] to represent a 3D line:
其中 \(\mathbf{v}\) 是线的方向向量,\(\mathbf{n}\) 是由线和原点确定的平面的法向量。普吕克坐标用于三维线的三角测量、变换和投影。它是过度参数化的,因为它是一个六维向量,而三维线只有四个自由度。在图优化阶段,额外的自由度会增加计算成本并导致系统数值不稳定 [72]。因此,我们还使用正交表示 [71] 来表示三维线:
The relationship between Plücker coordinates and orthonormal representation is similar to \({SO}\left( 3\right)\) and \({so}\left( 3\right)\) . Orthonormal representation can be obtained from Plücker coordinates by:
普吕克坐标与正交表示之间的关系类似于 \({SO}\left( 3\right)\) 和 \({so}\left( 3\right)\)。正交表示可以通过以下方式从普吕克坐标获得:
where \({\sum }_{3 \times 2}\) is a diagonal matrix and its two non-zero entries defined up to scale can be represented by an \({SO}\left( 2\right)\) matrix:
其中 \({\sum }_{3 \times 2}\) 是一个对角矩阵,其两个非零条目可以根据比例定义为一个 \({SO}\left( 2\right)\) 矩阵:
In practice, this conversion can be done simply and quickly with the QR decomposition.
在实践中,这种转换可以通过 QR 分解简单快速地完成。
-
Triangulation: Triangulation is to initialize a 3D line from two or more 2D line features. In our system, we use two methods to triangulate a 3D line. The first is similar to the line triangulation algorithm \(B\) in [73],where the pose of a 3D line can be computed from two planes. To achieve this, we select two line segments, \({\mathbf{l}}_{1}\) and \({\mathbf{l}}_{2}\) ,on two images,which are two observations of a 3D line. Note that the two images can come from the stereo pair of the same keyframe or two different keyframes. \({\mathbf{l}}_{1}\) and \({\mathbf{l}}_{2}\) can be back-projected and construct two 3D planes, \({\pi }_{1}\) and \({\pi }_{2}\) . Then the 3D line can be regarded as the intersection of \({\pi }_{1}\) and \({\pi }_{2}\) .
-
三角测量:三角测量是从两个或多个二维线特征初始化一条三维线。在我们的系统中,我们使用两种方法来三角测量一条三维线。第一种方法类似于[73]中的线三角测量算法\(B\),其中三维线的姿态可以从两个平面计算得出。为此,我们在两幅图像上选择两条线段,\({\mathbf{l}}_{1}\)和\({\mathbf{l}}_{2}\),它们是同一条三维线的两个观测值。请注意,这两幅图像可以来自同一关键帧的立体对或两个不同的关键帧。\({\mathbf{l}}_{1}\)和\({\mathbf{l}}_{2}\)可以反投影并构造两个三维平面,\({\pi }_{1}\)和\({\pi }_{2}\)。然后,三维线可以视为\({\pi }_{1}\)和\({\pi }_{2}\)的交点。
However, triangulating a 3D line is more difficult than triangulating a 3D point, because it suffers more from degenerate motions [73]. Therefore, we also employ a second line triangulation method if the above method fails, where points are utilized to compute the 3D line. In Section V-B, we have associated point features with line features. So to initialize a 3D line,two triangulated points \({\mathbf{X}}_{1}\) and \({\mathbf{X}}_{2}\) ,which belong to this line and have the shortest distance from this line on the image plane are selected. Then the Plücker coordinates of this line can be obtained through:
然而,三角测量一条三维线比三角测量一个三维点更为困难,因为它更容易受到退化运动的影响[73]。因此,如果上述方法失败,我们还采用第二种线三角测量方法,其中利用点来计算三维线。在第V-B节中,我们已经将点特征与线特征关联起来。因此,为了初始化一条三维线,我们选择两个属于该线并在图像平面上距离该线最短的三角测量点\({\mathbf{X}}_{1}\)和\({\mathbf{X}}_{2}\)。然后,可以通过以下公式获得该线的普吕克坐标:
This method requires little extra computation because the selected 3D points have been triangulated in the point triangulating stage. It is very efficient and robust.
这种方法所需的额外计算量很小,因为所选的三维点已经在点三角测量阶段被三角测量。它非常高效且稳健。
-
Re-projection: Re-projection is used to compute the re-projection errors. We use Plücker coordinates to transform and re-project 3D lines. First, we convert the 3D line from the world frame to the camera frame:
-
重投影:重投影用于计算重投影误差。我们使用普吕克坐标来转换和重投影三维直线。首先,我们将三维直线从世界坐标系转换到相机坐标系:
where \({\mathbf{L}}_{c}\) and \({\mathbf{L}}_{w}\) are Plücker coordinates of \(3\mathrm{D}\) line in the camera frame and world frame,respectively. \({\mathbf{R}}_{cw} \in {SO}\left( 3\right)\) is the rotation matrix from world frame to camera frame and \({\mathbf{t}}_{cw} \in {\mathbb{R}}^{3}\) is the translation vector. \({\left\lbrack \cdot \right\rbrack }_{ \times }\) denotes the skew-symmetric matrix of a vector and \({\mathbf{H}}_{cw}\) is the transformation matrix of 3D lines from world frame to camera frame.
其中 \({\mathbf{L}}_{c}\) 和 \({\mathbf{L}}_{w}\) 分别是相机坐标系和世界坐标系中 \(3\mathrm{D}\) 直线的普吕克坐标。\({\mathbf{R}}_{cw} \in {SO}\left( 3\right)\) 是世界坐标系到相机坐标系的旋转矩阵,\({\mathbf{t}}_{cw} \in {\mathbb{R}}^{3}\) 是平移向量。\({\left\lbrack \cdot \right\rbrack }_{ \times }\) 表示向量的斜对称矩阵,\({\mathbf{H}}_{cw}\) 是世界坐标系到相机坐标系的三维直线变换矩阵。
Then the \(3\mathrm{D}\) line \({\mathbf{L}}_{c}\) can be projected to the image plane through a line projection matrix \({\mathbf{P}}_{c}\) :
然后,\(3\mathrm{D}\) 直线 \({\mathbf{L}}_{c}\) 可以通过线投影矩阵 \({\mathbf{P}}_{c}\) 投影到图像平面上:
where \(\mathbf{l} = {\left\lbrack \begin{array}{lll} A & B & C \end{array}\right\rbrack }^{\top }\) is the re-projected \(2\mathrm{D}\) line on image plane. \({\mathbf{L}}_{c\left\lbrack { : 3}\right\rbrack }\) donates the first three rows of vector \({\mathbf{L}}_{c}\) .
其中 \(\mathbf{l} = {\left\lbrack \begin{array}{lll} A & B & C \end{array}\right\rbrack }^{\top }\) 是图像平面上的重投影 \(2\mathrm{D}\) 直线。\({\mathbf{L}}_{c\left\lbrack { : 3}\right\rbrack }\) 表示向量 \({\mathbf{L}}_{c}\) 的前三行。
D. 关键帧选择
Observing that the learning-based data association method used in our system is able to track two frames that have a large baseline, so different from the frame-by-frame tracking strategy used in other VO or SLAM systems, we only match the current frame with the last keyframe. We argue this strategy can reduce the accumulated tracking error.
观察到我们系统中使用的基于学习的数据关联方法能够跟踪具有较大基线的两帧,因此与其它VO或SLAM系统中使用的逐帧跟踪策略不同,我们仅将当前帧与最后一个关键帧进行匹配。我们认为这种策略可以减少累积的跟踪误差。
Therefore, the keyframe selection is essential for our system. On the one hand, as described in Section V-A, we want to make keyframes sparse to reduce the consumption of computational resources. On the other hand, the sparser the keyframes, the more likely tracking failure happens. To balance the efficiency and the tracking robustness, a frame will be selected as a keyframe if any of the following conditions is satisfied:
因此,关键帧选择对于我们的系统至关重要。一方面,如第V-A节所述,我们希望关键帧稀疏以减少计算资源的消耗。另一方面,关键帧越稀疏,跟踪失败的可能性越大。为了平衡效率和跟踪鲁棒性,如果满足以下任一条件,帧将被选为关键帧:
-
The tracked features are less than \({\alpha }_{1} \cdot {N}_{s}\) .
-
跟踪的特征少于 \({\alpha }_{1} \cdot {N}_{s}\) 。
-
The average parallax of tracked features between the current frame and the last keyframe is larger than \({\alpha }_{2} \cdot \sqrt{WH}\) .
-
当前帧与上一个关键帧之间跟踪特征的平均视差大于 \({\alpha }_{2} \cdot \sqrt{WH}\)。
-
The number of tracked features is less than \({N}_{kf}\) .
-
跟踪的特征数量少于 \({N}_{kf}\)。
In the above, \({\alpha }_{1},{\alpha }_{2}\) ,and \({N}_{kf}\) are all preset thresholds. \({N}_{s}\) is the number of detected features. \(W\) and \(H\) respectively represent the width and height of the input image.
在上文中,\({\alpha }_{1},{\alpha }_{2}\) 和 \({N}_{kf}\) 都是预设的阈值。\({N}_{s}\) 是检测到的特征数量。\(W\) 和 \(H\) 分别表示输入图像的宽度和高度。
E. Local Graph Optimization 局部图优化
To improve the accuracy, we perform the local bundle adjustment when a new keyframe is inserted. \({N}_{o}\) latest neighboring keyframes are selected to construct a local graph, where map points, 3D lines, and keyframes are vertices and pose constraints are edges. We use point constraints and line constraints as well as IMU constraints if an IMU is accessible. Their related error terms are defined as follows.
为了提高准确性,我们在插入新关键帧时执行局部束调整。选择最近的 \({N}_{o}\) 个相邻关键帧来构建局部图,其中地图点、三维线和关键帧是顶点,姿态约束是边。如果可以访问IMU,我们还使用点约束、线约束以及IMU约束。它们的相关误差项定义如下。
-
Point Re-projection Error: If the frame \(i\) can observe the 3D map point \({\mathbf{X}}_{p}\) ,then the re-projection error is defined as:
-
点重投影误差:如果帧 \(i\) 可以观察到三维地图点 \({\mathbf{X}}_{p}\),则重投影误差定义为:
where \({\widetilde{\mathbf{x}}}_{i,p}\) is the observation of \({\mathbf{X}}_{p}\) on frame \(i\) and \(\pi \left( \cdot \right)\) represents the camera projection.
其中 \({\widetilde{\mathbf{x}}}_{i,p}\) 是 \({\mathbf{X}}_{p}\) 在帧 \(i\) 上的观测,\(\pi \left( \cdot \right)\) 表示相机投影。
-
Line Re-projection Error: If the frame \(i\) can observe the 3D line \({\mathbf{L}}_{q}\) ,then the re-projection error is defined as:
-
线重投影误差:如果帧 \(i\) 可以观察到三维线 \({\mathbf{L}}_{q}\),则重投影误差定义为:
where \({\widetilde{\mathbf{l}}}_{i,q}\) is the observation of \({\mathbf{L}}_{q}\) on frame \(i,{\widetilde{\mathbf{p}}}_{i,{q1}}\) and \({\widetilde{\mathbf{p}}}_{i,{q2}}\) are the endpoints of \({\widetilde{\mathbf{l}}}_{i,q}\) ,and \(d\left( {\mathbf{p},\mathbf{l}}\right)\) is the distance between point \(\mathbf{p}\) and line \(\mathbf{l}\) which is computed through (3).
其中 \({\widetilde{\mathbf{l}}}_{i,q}\) 是 \({\mathbf{L}}_{q}\) 在帧 \(i,{\widetilde{\mathbf{p}}}_{i,{q1}}\) 上的观测,\({\widetilde{\mathbf{p}}}_{i,{q2}}\) 是 \({\widetilde{\mathbf{l}}}_{i,q}\) 的端点,\(d\left( {\mathbf{p},\mathbf{l}}\right)\) 是点 \(\mathbf{p}\) 和线 \(\mathbf{l}\) 之间的距离,通过公式 (3) 计算。
-
IMU Residuals: We first follow [69] to pre-integrate IMU measurements between the frame \(i\) and the frame \(j\) :
-
IMU残差:我们首先按照 [69] 预积分帧 \(i\) 和帧 \(j\) 之间的IMU测量值:
where \({\widetilde{\mathbf{\omega }}}_{\mathbf{k}}\) and \({\widetilde{\mathbf{a}}}_{k}\) are respectively the angular velocity and the acceleration. \({\mathbf{b}}_{k}^{g}\) and \({\mathbf{b}}_{k}^{a}\) are biases of the sensor and they are modeled as constants between two keyframes through \({\mathbf{b}}_{k}^{g} = {\mathbf{b}}_{k + 1}^{g}\) and \({\mathbf{b}}_{k}^{a} = {\mathbf{b}}_{k + 1}^{a}.{\mathbf{\eta }}_{k}^{gd}\) and \({\mathbf{\eta }}_{k}^{ad}\) are Gaussian noises. Then IMU residuals are defined as:
其中 \({\widetilde{\mathbf{\omega }}}_{\mathbf{k}}\) 和 \({\widetilde{\mathbf{a}}}_{k}\) 分别是角速度和加速度。\({\mathbf{b}}_{k}^{g}\) 和 \({\mathbf{b}}_{k}^{a}\) 是传感器的偏差,它们在两个关键帧之间被建模为常数,通过 \({\mathbf{b}}_{k}^{g} = {\mathbf{b}}_{k + 1}^{g}\)、\({\mathbf{b}}_{k}^{a} = {\mathbf{b}}_{k + 1}^{a}.{\mathbf{\eta }}_{k}^{gd}\) 和 \({\mathbf{\eta }}_{k}^{ad}\) 表示,并且它们是高斯噪声。然后,IMU 残差定义为:
where \(\mathbf{g}\) is the gravity vector in world coordinates. In our system, we combine the initialization process in [3] and [60] to estimate \(\mathbf{g}\) and initial values of biases.
其中 \(\mathbf{g}\) 是世界坐标系中的重力矢量。在我们的系统中,我们结合了 [3] 和 [60] 中的初始化过程来估计 \(\mathbf{g}\) 和偏差的初始值。
The factor graph is optimized by the g2o toolbox [74]. The cost function is defined as:
因子图通过 g2o 工具箱 [74] 进行优化。代价函数定义为:
We use the Levenberg-Marquardt optimizer to minimize the cost function. The point and line outliers are also rejected in the optimization if their corresponding residuals are too large.
我们使用 Levenberg-Marquardt 优化器来最小化代价函数。如果对应的残差过大,点和平面外点也会在优化过程中被拒绝。
F. 初始地图
As described in Section III, our map is optimized offline. Therefore, keyframes, map points, and 3D lines will be saved to the disk for subsequent optimization when the visual odometry related methods include that of Huang and Liu [52], which keypoints, keypoint descriptors, line features, and junctions. The correspondences between \(2\mathrm{D}\) features and \(3\mathrm{D}\) features are also recorded. To make the map faster to save, load, and transfer across different devices, the above information is stored in binary form, which also makes the initial map much smaller than the raw data. For example, on the OIVIO dataset [75], our initial map size is only about \(2\%\) of the raw data size.
如第三节所述,我们的地图是离线优化的。因此,关键帧、地图点和三维线将被保存到磁盘上,以便后续优化,当视觉里程计相关方法包括 Huang 和 Liu [52] 的方法时,关键点、关键点描述符、线特征和交点。\(2\mathrm{D}\) 特征和 \(3\mathrm{D}\) 特征之间的对应关系也被记录。为了使地图更快地保存、加载和在不同设备之间传输,上述信息以二进制形式存储,这也使得初始地图比原始数据小得多。例如,在 OIVIO 数据集 [75] 上,我们的初始地图大小仅为原始数据大小的约 \(2\%\)。
VI. 地图优化与重用
A. 离线地图优化
This part aims to process an initial map generated by our VO module and outputs the optimized map that can be used for drift-free relocalization. Our offline map optimization module consists of the following several map-processing plugins.
这部分旨在处理由我们的 VO 模块生成的初始地图,并输出可用于无漂移重定位的优化地图。我们的离线地图优化模块包括以下几个地图处理插件。
-
Loop Closure Detection: Similar to most current vSLAM systems, we use a coarse-to-fine pipeline to detect loop closures. Our loop closure detection relies on DBoW2 [76] to retrieve candidates and LightGlue [26] to match features. We train a vocabulary for the keypoint detected by our PLNet on a database that contains \({35}\mathrm{k}\) images. These images are selected from several large datasets [77]-[79] that include both indoor and outdoor scenes. The vocabulary has 4 layers, with 10 nodes at each layer, so it contains 10,000 words.
-
闭环检测:与大多数当前的 vSLAM 系统类似,我们采用由粗到细的流程来检测闭环。我们的闭环检测依赖于 DBoW2 [76] 来检索候选对象,并使用 LightGlue [26] 来匹配特征。我们在包含 \({35}\mathrm{k}\) 张图像的数据库上为 PLNet 检测到的关键点训练了一个词汇表。这些图像选自包含室内外场景的几个大型数据集 [77]-[79]。该词汇表有 4 层,每层有 10 个节点,因此包含 10,000 个词汇。
Coarse Candidate Selection: This step aims to find three candidates most similar to a keyframe \({\mathcal{K}}_{i}\) from a set \({\mathcal{S}}_{1} =\) \(\left\{ {{\mathcal{K}}_{j} \mid j < i}\right\}\) . Note that we do not add keyframes with an index greater than \({\mathcal{K}}_{i}\) to the set because this may miss some loop pairs. We build a co-visibility graph for all keyframes where two are connected if they obverse at last one feature. All keyframes connected with \({\mathcal{K}}_{i}\) will be first removed from \({\mathcal{S}}_{1}\) . Then we compute a similarity score between \({\mathcal{K}}_{i}\) and each keyframe in \({\mathcal{S}}_{1}\) using DBoW2. Only keyframes with a score greater than \({0.3} \cdot {S}_{\max }\) will be kept in \({\mathcal{S}}_{1}\) ,where \({S}_{\max }\) is the maximum computed score. After that, we group the remaining keyframes. If two keyframes can observe more than 10 features in common, they will be in the same group. For each group, we sum up the scores of the keyframes in this group and use it as the group score. Only the top 3 groups with the highest scores will be retained. Then we select one keyframe with the highest score within the group as the candidate from each group. These three candidates will be processed in the subsequent steps.
粗略候选选择:此步骤旨在从集合 \({\mathcal{S}}_{1} =\) \(\left\{ {{\mathcal{K}}_{j} \mid j < i}\right\}\) 中找到与关键帧 \({\mathcal{K}}_{i}\) 最相似的三个候选对象。请注意,我们不会将索引大于 \({\mathcal{K}}_{i}\) 的关键帧添加到集合中,因为这可能会遗漏一些闭环对。我们为所有关键帧构建了一个共视图,如果两个关键帧至少观察到一个特征,则它们相连。首先从 \({\mathcal{S}}_{1}\) 中移除与 \({\mathcal{K}}_{i}\) 相连的所有关键帧。然后使用 DBoW2 计算 \({\mathcal{K}}_{i}\) 与 \({\mathcal{S}}_{1}\) 中每个关键帧之间的相似度得分。只有得分大于 \({0.3} \cdot {S}_{\max }\) 的关键帧才会保留在 \({\mathcal{S}}_{1}\) 中,其中 \({S}_{\max }\) 是计算出的最大得分。之后,我们对剩余的关键帧进行分组。如果两个关键帧可以共同观察到超过 10 个特征,它们将属于同一组。对于每个组,我们将该组中关键帧的得分相加,并将其作为组得分。仅保留得分最高的三个组。然后,我们从每个组中选择得分最高的一个关键帧作为该组的候选对象。这三个候选对象将在后续步骤中进行处理。
Fine Feature Matching: For each selected candidate, we match its features with \({\mathcal{K}}_{i}\) . Then the relative pose estimation with outlier rejection will be performed. The candidate will form a valid loop pair with \({\mathcal{K}}_{i}\) if the inliers exceed 50 .
精细特征匹配:对于每个选定的候选对象,我们将其特征与 \({\mathcal{K}}_{i}\) 进行匹配。然后,将执行带有异常值剔除的相对姿态估计。如果内点超过 50,候选对象将与 \({\mathcal{K}}_{i}\) 形成有效的回环对。
-
Map Merging: A 3D feature observed by both frames of a loop pair is usually mistakenly used as two features. Therefore, in this part, we aim to merge the duplicated point and line features observed by loop pairs. For keypoint features, we use the above feature-matching results between loop pairs. If two matched keypoints are associated with two different map points, they will be regarded as duplicated features and only one map point will be retained. The correspondence between 2D keypoints and 3D map points, as well as the connections in the co-visibility graph, will also be updated.
-
地图合并:一个回环对的两帧共同观察到的 3D 特征通常会被错误地用作两个特征。因此,在这一部分,我们的目标是合并由回环对观察到的重复的点特征和线特征。对于关键点特征,我们使用上述回环对之间的特征匹配结果。如果两个匹配的关键点与两个不同的地图点相关联,它们将被视为重复特征,并且只保留一个地图点。二维关键点与三维地图点之间的对应关系,以及共视图中的连接,也将被更新。
For line features, we first associate 3D lines and map points through the 2D-3D feature correspondence and 2D point-line association built in Section V-B. Then we detect 3D line pairs that associate with the same map points. If two 3D lines share more than 3 associated map points, they will be regarded as duplicated and only one 3D line will be retained.
对于线特征,我们首先通过第 V-B 节中建立的二维-三维特征对应关系和二维点-线关联来关联三维线和地图点。然后,我们检测与相同地图点相关联的三维线对。如果两条三维线共享超过 3 个相关联的地图点,它们将被视为重复的,并且只保留一条三维线。
-
Global Bundle Adjustment: We perform the global bundle adjustment (GBA) after merging duplicated features. The residuals and cost function are similar to Section V-E while the difference is that all keyframes and features will be optimized in this module. In the initial stage of optimization, the re-projection errors of merged features are relatively large due to the VO drift error, so we first iterate 50 times without outlier rejection to optimize the variables to a good rough position, and then iterate another 40 times with outlier rejection.
-
全局束调整:在合并重复特征后,我们执行全局束调整(GBA)。残差和成本函数与第 V-E 节类似,不同之处在于所有关键帧和特征将在本模块中进行优化。在优化的初始阶段,由于视觉里程计漂移误差,合并特征的重投影误差相对较大,因此我们首先迭代 50 次不进行异常值剔除以将变量优化到良好的粗略位置,然后进行另外 40 次迭代并进行异常值剔除。
We find that when the map is large, the initial 50 iterations can not optimize the variables to a satisfactory position. To address this, we first perform pose graph optimization (PGO) before the global bundle adjustment if a map contains more than \({80}\mathrm{k}\) map points. Only the keyframe poses will be adjusted in the PGO and the cost function is defined as follows:
我们发现,当地图较大时,最初的50次迭代无法将变量优化到一个满意的位置。为了解决这个问题,如果一个地图包含超过 \({80}\mathrm{k}\) 个地图点,我们首先在全局束调整之前进行位姿图优化(PGO)。只有在PGO中,关键帧的位姿会被调整,且代价函数定义如下:
where \({\mathbf{T}}_{i} \in \mathrm{{SE}}\left( 3\right)\) and \({\mathbf{T}}_{j} \in \mathrm{{SE}}\left( 3\right)\) are poses of \({\mathcal{K}}_{i}\) and \({\mathcal{K}}_{j}\) , respectively. \({\mathcal{K}}_{i}\) and \({\mathcal{K}}_{j}\) should either be adjacent or form a loop pair. After the pose graph optimization, the positions of map points and 3D lines will also be adjusted along with the keyframes in which they are first observed.
其中 \({\mathbf{T}}_{i} \in \mathrm{{SE}}\left( 3\right)\) 和 \({\mathbf{T}}_{j} \in \mathrm{{SE}}\left( 3\right)\) 分别是 \({\mathcal{K}}_{i}\) 和 \({\mathcal{K}}_{j}\) 的位姿。\({\mathcal{K}}_{i}\) 和 \({\mathcal{K}}_{j}\) 应该是相邻的或形成一个闭环对。在位姿图优化之后,地图点和3D线的位置也会随着它们首次被观察到的关键帧一起调整。
The systems with online loop detection usually perform the GBA after detecting a new loop, so they undergo multiple repeated GBAs when a scene contains many loops. In contrast, our offline map optimization module only does the GBA after all loop closures are detected, allowing us to reduce the optimization iterations significantly compared with them.
具有在线闭环检测的系统通常在检测到新的闭环后执行GBA,因此当场景包含许多闭环时,它们会经历多次重复的GBA。相比之下,我们的离线地图优化模块只在所有闭环被检测到后执行GBA,这使得我们能够显著减少优化迭代次数。
-
Scene-Dependent Vocabulary: We train a junction vocabulary aiming to be used for relocalization. The vocabulary is built on the junctions of keyframes in the map so it is scene-dependent. Compared with the keypoint vocabulary trained in Section VI-A1, the database used to train the junction vocabulary is generally much smaller, so we set the number of layers to 3 , with 10 nodes in each layer. The junction vocabulary is tiny, i.e., about 1 megabyte, as it only contains 1000 words. Its detailed usage will be introduced in Section VI-B.
-
场景依赖词汇:我们训练了一个用于重定位的交汇点词汇。该词汇是基于地图中关键帧的交汇点构建的,因此是场景依赖的。与第VI-A1节中训练的关键点词汇相比,用于训练交汇点词汇的数据库通常要小得多,因此我们将层数设置为3层,每层有10个节点。交汇点词汇很小,即大约1兆字节,因为它只包含1000个单词。其详细用法将在第VI-B节中介绍。
-
Optimized Map: we save the optimized map for the subsequent map reuse. Compared with the initial map in Section V-F, more information is saved such as the bag of words for each keyframe, the global co-visibility graph, and the scene-dependent junction vocabulary. In the meantime, the number of \(3\mathrm{D}\) features has decreased due to the fusion of duplicate map points and 3D lines. Therefore, the optimized map occupies a similar memory to the initial map.
-
优化地图:我们保存优化后的地图以便后续地图复用。与第五节-F中的初始地图相比,保存了更多信息,例如每个关键帧的词袋、全局共视图和场景相关联接词汇。同时,由于融合了重复的地图点和三维线段,\(3\mathrm{D}\)特征的数量减少了。因此,优化后的地图占用的内存与初始地图相似。
B. 地图复用
In this part, we present our illumination-robust relocalization using an existing optimized map. In most vSLAM systems, recognizing revisited places typically needs two steps: (1) retrieving \({N}_{kc}\) keyframe candidates and (2) performing feature matching and estimating relative pose. The second step is usually time-consuming,so selecting a proper \({N}_{kc}\) is very important. A larger \({N}_{kc}\) will reduce the system’s efficiency while a smaller \({N}_{kc}\) may prevent the correct candidate from being recalled. For example, in the loop closing module of ORB-SLAM3 [60], only the three most similar keyframes retrieved by DBoW2 [76] are used for better efficiency. It works well as two frames in a loop pair usually have a short time interval and thus the lighting conditions are relatively similar. But for challenging tasks, such as the day/night relocalization problem, retrieving so few candidates usually results in a low recall rate. However, retrieving more candidates needs to perform feature matching and pose estimation more times for each query frame, which makes it difficult to deploy for real-time applications.
在这一部分,我们介绍了使用现有优化地图进行光照鲁棒的重定位。在大多数vSLAM系统中,识别重访地点通常需要两个步骤:(1)检索\({N}_{kc}\)关键帧候选者,(2)进行特征匹配和估计相对姿态。第二步通常耗时较长,因此选择合适的\({N}_{kc}\)非常重要。较大的\({N}_{kc}\)会降低系统的效率,而较小的\({N}_{kc}\)可能会阻止正确候选者的召回。例如,在ORB-SLAM3 [60]的闭环检测模块中,仅使用DBoW2 [76]检索到的三个最相似的关键帧以提高效率。这在两个闭环帧通常时间间隔较短且光照条件相对相似的情况下效果良好。但对于具有挑战性的任务,如昼夜重定位问题,检索如此少的候选者通常会导致召回率较低。然而,检索更多候选者需要对每个查询帧进行更多次的特征匹配和姿态估计,这使得实时应用部署变得困难。
To address this problem, we propose an efficient multi-stage relocalization method to make the optimized map usable in different lighting conditions. Our insight is that if most of the false candidates can be quickly filtered out, then the efficiency can be improved while maintaining or even improving the relocalization recall rate. Therefore, we add another step to the two-step pipeline mentioned above. We next introduce the proposed multi-stage pipeline in detail.
为了解决这个问题,我们提出了一种高效的多阶段重定位方法,以使优化后的地图在不同光照条件下可用。我们的见解是,如果大多数错误候选对象能够被快速过滤掉,那么效率可以在保持或甚至提高重定位召回率的同时得到提升。因此,我们在上述两步流程中增加了一个步骤。接下来,我们将详细介绍所提出的多阶段流程。
0.469 keyframes in the map that are similar to the query frame. For each input monocular image, we detect keypoints, junctions, and line features using our PLNet. Then a pipeline similar to the "coarse candidate selection" in Section VI-A1 will be executed, but with two differences. The first difference is that we do not filter out candidates using the co-visibility graph as the query frame is not in the graph. The second is that all candidates, not just three, will be retained for the next step.
地图中与查询帧相似的0.469关键帧。对于每个输入的单目图像,我们使用PLNet检测关键点、交点和线段特征。然后执行类似于第VI-A1节中的“粗略候选选择”的流程,但有两个不同之处。第一个区别是我们不使用共可见性图来过滤候选对象,因为查询帧不在图中。第二个区别是所有候选对象,不仅仅是三个,都将保留到下一步。
-
The Second Step: This step filters out most of the candidates selected in the first step using junctions and line features. For query frame \({\mathcal{K}}_{q}\) and each candidate \({\mathcal{K}}_{b}\) ,we first match their junctions by finding the same words through the junction vocabulary trained in Section VI-A4. We use \(\left\{ {\left( {{q}_{i},{b}_{i}}\right) \mid {q}_{i} \in {\mathcal{K}}_{q},{b}_{i} \in {\mathcal{K}}_{b}}\right\}\) to denote the matching pairs. Then we construct two structure graphs,i.e., \({G}_{q}^{J}\) and \({G}_{b}^{J}\) ,for \({\mathcal{K}}_{q}\) and \({\mathcal{K}}_{b}\) ,respectively. The vertices are matched junctions, i.e., \({V}_{q}^{J} = \left\{ {{q}_{i} \mid {q}_{i} \in {\mathcal{K}}_{q}}\right\}\) and \({V}_{b}^{J} = \left\{ {{b}_{i} \mid {b}_{i} \in {\mathcal{K}}_{b}}\right\}\) . The related adjacent matrices that describe the connection between vertices are defined as:
-
第二步:这一步使用交点和线段特征过滤掉第一步中选出的大多数候选对象。对于查询帧 \({\mathcal{K}}_{q}\) 和每个候选 \({\mathcal{K}}_{b}\),我们首先通过在第VI-A4节中训练的交点词汇表找到相同的词来匹配它们的交点。我们用 \(\left\{ {\left( {{q}_{i},{b}_{i}}\right) \mid {q}_{i} \in {\mathcal{K}}_{q},{b}_{i} \in {\mathcal{K}}_{b}}\right\}\) 表示匹配的对。然后我们分别为 \({\mathcal{K}}_{q}\) 和 \({\mathcal{K}}_{b}\) 构建两个结构图,即 \({G}_{q}^{J}\) 和 \({G}_{b}^{J}\)。顶点是匹配的交点,即 \({V}_{q}^{J} = \left\{ {{q}_{i} \mid {q}_{i} \in {\mathcal{K}}_{q}}\right\}\) 和 \({V}_{b}^{J} = \left\{ {{b}_{i} \mid {b}_{i} \in {\mathcal{K}}_{b}}\right\}\)。描述顶点之间连接的相关邻接矩阵定义为:
where \(n\) is the number of junction-matching pairs. \({q}_{ij}\) is set to 1 if the junction \({q}_{i}\) and \({q}_{j}\) are two endpoints of the same line,otherwise,it is set to 0 . The same goes for \({b}_{ij}\) . Then the graph similarity of \({G}_{q}^{J}\) and \({G}_{b}^{J}\) can be computed through:
其中 \(n\) 是连接匹配对的数目。如果连接 \({q}_{i}\) 和 \({q}_{j}\) 是同一条线的两个端点,则 \({q}_{ij}\) 设为 1,否则设为 0。\({b}_{ij}\) 也是如此。然后,\({G}_{q}^{J}\) 和 \({G}_{b}^{J}\) 的图相似度可以通过以下公式计算:
We also compute a junction similarity score \({S}_{qb}^{J}\) using the junction vocabulary and the DBoW2 algorithm. Finally, the similarity score of \({\mathcal{K}}_{q}\) and \({\mathcal{K}}_{b}\) is given by combining the keypoint similarity, junction similarity, and structure graph similarity:
我们还使用连接词汇和DBoW2算法计算了一个连接相似度分数 \({S}_{qb}^{J}\)。最后,\({\mathcal{K}}_{q}\) 和 \({\mathcal{K}}_{b}\) 的相似度分数通过结合关键点相似度、连接相似度和结构图相似度来给出:
where \({S}_{qb}^{K}\) is the keypoint similarity of \({\mathcal{K}}_{q}\) and \({\mathcal{K}}_{b}\) computed in the first step. We compute the similarity score with the query frame for each candidate, and only the top 3 candidates with the highest similarity scores will be retained for the next step.
其中 \({S}_{qb}^{K}\) 是第一步中计算的 \({\mathcal{K}}_{q}\) 和 \({\mathcal{K}}_{b}\) 的关键点相似度。我们为每个候选帧计算与查询帧的相似度分数,并仅保留相似度分数最高的前3个候选帧进入下一步。
Analysis: We next analyze the second step. In the normal two-step pipeline that uses the DBoW method, only appearance information is used to retrieve candidates. The structural information, i.e., the necessity of the consistent spatial distribution of features between the query frame and candidate, is ignored in the first step and only used in the second step. However, in the illumination-challenging scenes, the structural information is essential as it is invariant to lighting conditions. In our second step, a portion of the structural information is utilized to select candidates. First, our PLNet uses the wireframe-parsing method to detect structural lines, which are more stable in illumination-challenging environments. Second, the similarity computed in (20) utilizes both the appearance information and the structural information. Therefore, our system can achieve good performance in illumination-challenging environments although using the efficient DBoW method.
分析:接下来我们分析第二步。在正常使用DBoW方法的两步流程中,仅使用外观信息来检索候选帧。结构信息,即查询帧和候选帧之间特征的空间分布一致性的必要性,在第一步中被忽略,仅在第二步中使用。然而,在光照挑战场景中,结构信息至关重要,因为它是光照条件不变的。在我们的第二步中,一部分结构信息被用来选择候选帧。首先,我们的PLNet使用线框解析方法来检测结构线,这些结构线在光照挑战环境中更为稳定。其次,公式(20)中计算的相似度同时利用了外观信息和结构信息。因此,尽管使用了高效的DBoW方法,我们的系统仍能在光照挑战环境中取得良好的性能。
The second step is also highly efficient. On the one hand, junctions are usually much less than keyponts. In normal scenes, our PLNet can detect more than 400 good keyponts but only about 50 junctions. On the other hand, the junction vocabulary is tiny and only contains 1,000 words. Therefore, matching junctions using DBoW2, constructing junction graphs, and computing similarity scores are all executed very efficiently. The experiment shows that the second step can be done within \({0.7}\mathrm{\;{ms}}\) . More results will be presented in Section VII.
第二步同样非常高效。一方面,交叉点通常远少于关键点。在正常场景中,我们的 PLNet 可以检测到超过 400 个优质关键点,但只有大约 50 个交叉点。另一方面,交叉点词汇量很小,仅包含 1,000 个单词。因此,使用 DBoW2 匹配交叉点、构建交叉点图以及计算相似度分数都非常高效。实验表明,第二步可以在 \({0.7}\mathrm{\;{ms}}\) 内完成。更多结果将在第七节中展示。
-
The Third Step: The third step aims to estimate the pose of the query frame. We first use LightGlue to match features between the query frame and the retained candidates. The candidate with the most matching inliers will be selected as the best candidate. Then based on the matching results of the query frame and the best candidate, we can associate the query keypoints with map points. Finally, a PnP problem is solved with RANSAC to estimate the pose. The pose will be considered valid if the inliers exceed 20 .
-
第三步:第三步旨在估计查询帧的姿态。我们首先使用 LightGlue 在查询帧和保留的候选帧之间匹配特征。具有最多匹配内点的候选帧将被选为最佳候选帧。然后,基于查询帧和最佳候选帧的匹配结果,我们可以将查询关键点与地图点关联起来。最后,通过 RANSAC 解决 PnP 问题来估计姿态。如果内点数量超过 20,则认为姿态有效。
VII. 实验
In this section, we present the experiment results. The remainder of this section is organized as follows. In Section VII-A, we evaluate the line detection performance of the proposed PLNet. In Section VII-B, we evaluate the mapping accuracy of our system by comparing it with other SOTA VO or SLAM systems. In Section VII-C, we test our system in three illumination-challenging scenarios: onboard illumination, dynamic illumination, and low illumination. The comparison of these three scenarios will show the excellent robustness of our system. In Section VII-D, we assess the performance of the proposed map reuse module in addressing the day/night localization challenges, i.e., mapping during the day and relocalization at night. In Section VII-E, we present the ablation study. In Section VII-F, we evaluate the efficiency.
在本节中,我们展示了实验结果。本节的其余部分组织如下。在第七章A节中,我们评估了所提出的PLNet的线条检测性能。在第七章B节中,我们通过与其他SOTA视觉里程计(VO)或同时定位与地图构建(SLAM)系统进行比较,评估了我们系统的地图精度。在第七章C节中,我们在三种光照挑战场景下测试我们的系统:车载光照、动态光照和低光照。这三种场景的比较将展示我们系统的出色鲁棒性。在第七章D节中,我们评估了所提出的地图重用模块在解决昼夜定位挑战方面的性能,即白天进行地图构建,夜间进行重定位。在第七章E节中,我们展示了消融研究。在第七章F节中,我们评估了效率。
We use two platforms in the experiments. Most evaluations are conducted on a personal computer with an Intel i9-13900 CPU and a NVIDIA GeForce RTX 4070 GPU. In the efficiency experiment in Section VII-F, we also deploy AirSLAM on an NVIDIA Jetson Orin to prove that our system can achieve good accuracy and efficiency on the embedded platform.
我们在实验中使用了两个平台。大多数评估是在一台配备Intel i9-13900 CPU和NVIDIA GeForce RTX 4070 GPU的个人电脑上进行的。在第七章F节的效率实验中,我们还将AirSLAM部署在NVIDIA Jetson Orin上,以证明我们的系统在嵌入式平台上能够实现良好的精度和效率。
A. Line Detection 线条检测
In this section, we evaluate the performance of our PLNet. As described in Section IV-B, we follow SuperPoint [23] to design and train our backbone and keypoint detection module, and we can even use the pre-trained model of SuperPoint, therefore, we do not evaluate the keypoint detection anymore. Instead, we assess the performance of the line detection module by comparing it with SOTA systems, as it is trained with a fixed backbone, which is different from other line detectors.
在本节中,我们评估了我们的PLNet的性能。如第四章B节所述,我们遵循SuperPoint [23]来设计和训练我们的骨干网络和关键点检测模块,并且我们甚至可以使用SuperPoint的预训练模型,因此,我们不再评估关键点检测。相反,我们通过与SOTA系统进行比较来评估线条检测模块的性能,因为它是在固定的骨干网络上训练的,这与其他线条检测器不同。
-
Datasets and Baseliens: This experiment is conducted on the Wireframe dataset [66] and the YorkUrban dataset [80]. The Wireframe dataset contains 5,000 training images and 462 test images that are all collected in man-made environments. We use them to train and test our PLNet. To validate the generalization ability, we also compare various methods on the YorkUrban dataset, which contains 102 test images. All the training and test images are resized to \({512} \times {512}\) . We compare our method with AFM [81], AFM++ [82], L-CNN [47], LETR [83], F-Clip [84], ELSD [85], and HAWPv2 [49].
-
数据集和基准:本实验在Wireframe数据集[66]和YorkUrban数据集[80]上进行。Wireframe数据集包含5,000张训练图像和462张测试图像,均采集自人工环境。我们使用这些图像来训练和测试我们的PLNet。为了验证泛化能力,我们还在YorkUrban数据集上比较了多种方法,该数据集包含102张测试图像。所有训练和测试图像均被调整为\({512} \times {512}\)。我们将我们的方法与AFM[81]、AFM++[82]、L-CNN[47]、LETR[83]、F-Clip[84]、ELSD[85]和HAWPv2[49]进行了比较。
-
Evaluation Metrics: We evaluate both the accuracy and efficiency of the line detection. For accuracy, the structural average precision (sAP) [47] is the most challenging metric of the wireframe parsing task. It is inspired by the mean average precision (mAP) commonly used in object detection. A detected line \(\widetilde{l} = \left( {{\widetilde{\mathbf{p}}}_{1},{\widetilde{\mathbf{p}}}_{2}}\right)\) is a True Positive (TP) if and only if it satisfies the following:
-
评估指标:我们评估了线条检测的准确性和效率。对于准确性,结构平均精度(sAP)[47]是线框解析任务中最具挑战性的指标。它受到物体检测中常用的平均精度均值(mAP)的启发。一个检测到的线条\(\widetilde{l} = \left( {{\widetilde{\mathbf{p}}}_{1},{\widetilde{\mathbf{p}}}_{2}}\right)\)只有在满足以下条件时才被视为真阳性(TP):
where \(\mathcal{L}\) is the set of ground truth,and \(\vartheta\) is a predefined threshold. We follow the previous methods to set \(\vartheta\) to 5,10, and 15 , then the corresponding sAP scores are represented by \({\mathrm{{sAP}}}^{5},{\mathrm{{sAP}}}^{10}\) ,and \({\mathrm{{sAP}}}^{15}\) ,respectively. For efficiency,we use the frames per second (FPS) to evaluate various systems.
其中\(\mathcal{L}\)是真实值集合,\(\vartheta\)是一个预定义的阈值。我们遵循先前的方法,将\(\vartheta\)设置为5、10和15,然后相应的sAP分数分别表示为\({\mathrm{{sAP}}}^{5},{\mathrm{{sAP}}}^{10}\)和\({\mathrm{{sAP}}}^{15}\)。对于效率,我们使用每秒帧数(FPS)来评估各种系统。
-
Results and Analysis: We present the results in Table I. The top-performing results are distinctly highlighted and underlined in order. It can be seen that our PLNet achieves the second-best performance on the Wireframe dataset and the best performance on the YorUrban dataset. On the Wireframe dataset, HAWPv2, the best method, only outperforms our PLNet by \({0.5},{0.5}\) ,and 0.4 points in \({\mathrm{{sAP}}}^{5},{\mathrm{{sAP}}}^{10}\) ,and \({\mathrm{{sAP}}}^{15}\) , respectively. On the YorUrban dataset, our method surpasses the second-best method by \({0.4},{0.8}\) ,and 0.9 points on these three metrics, respectively. Overall, we can conclude that our PLNet achieves comparable accuracy with SOTA methods.
-
结果与分析:我们在表I中展示了结果。表现最佳的结果按顺序明显突出并加下划线。可以看出,我们的PLNet在Wireframe数据集上取得了第二好的性能,在YorUrban数据集上取得了最佳性能。在Wireframe数据集上,最佳方法HAWPv2仅在\({0.5},{0.5}\)、\({\mathrm{{sAP}}}^{5},{\mathrm{{sAP}}}^{10}\)和\({\mathrm{{sAP}}}^{15}\)上分别比我们的PLNet高出0.4分。在YorUrban数据集上,我们的方法在这三个指标上分别比第二好的方法高出\({0.4},{0.8}\)和0.9分。总的来说,我们可以得出结论,我们的PLNet与SOTA方法相比具有相当的准确性。
Generalizability Analysis: We can also conclude that the generalizability of our PLNet is better than other methods. This conclusion is based on two comparative results between our method and HAWPv2, which is the current best wireframe parsing method. First, on the Wireframe dataset, which also serves as the training dataset, HAWPv2 outperforms our PLNet. However, on the YorUrban dataset, it is surpassed by our method. Second, the previous methods are all evaluated with color inputs in their original paper. Considering that grayscale images are also widely used in vSLAM systems, we train our PLNet with grayscale inputs. We also retrain HAWPv2 and evaluate it using grayscale images for comparison. The result shows that our PLNet significantly outperforms HAWPv2 on both datasets when the inputs are grayscale images. We think the better generalizability comes from our backbone. Other methods are trained on only 5,000 images of the Wireframe dataset, while our backbone is trained on a large diverse dataset, which gives it a stronger feature extraction capability.
泛化性分析:我们还可以得出结论,我们的PLNet的泛化性优于其他方法。这一结论基于我们的方法与当前最佳的线框解析方法HAWPv2之间的两个比较结果。首先,在Wireframe数据集(同时也是训练数据集)上,HAWPv2优于我们的PLNet。然而,在YorUrban数据集上,它被我们的方法超越。其次,先前的方法在其原始论文中都是使用彩色输入进行评估的。考虑到灰度图像在vSLAM系统中也广泛使用,我们使用灰度输入训练我们的PLNet。我们还重新训练HAWPv2,并使用灰度图像进行评估以进行比较。结果显示,当输入为灰度图像时,我们的PLNet在两个数据集上都显著优于HAWPv2。我们认为更好的泛化性来自于我们的骨干网络。其他方法仅在Wireframe数据集的5,000张图像上进行训练,而我们的骨干网络在一个大型多样化的数据集上进行训练,这赋予了它更强的特征提取能力。

Fig. 6. The line detection comparison between our PLNet (a wireframe parsing method) and SOLD2 (a non-wireframe-parsing method). The red lines are detected line features and the green points are endpoints of lines. Our PLNet aims to detect structural lines while SOLD2 detects more general lines with significant gradients, such as the patterns on the floor and walls.
图 6. 我们的 PLNet(一种线框解析方法)与 SOLD2(一种非线框解析方法)之间的线条检测比较。红色线条是检测到的线条特征,绿色点是线条的端点。我们的 PLNet 旨在检测结构线条,而 SOLD2 则检测具有显著梯度的更一般线条,如地板和墙壁上的图案。
TABLE I
The comparison of various wireframe parsing methods. The top TWO RESULTS ARE HIGHLIGHTED AND UNDERLINED IN ORDER.
各种线框解析方法的比较。前两名结果已加粗并下划线标注。
Methods1 | Wireframe Dataset | YorkUrban Dataset | FPS | ||||
---|---|---|---|---|---|---|---|
sAP5 | sAP10 | sAP15 | sAP5 | sAP10 | sAP15 | ||
AFM [81] | 18.5 | 24.4 | 27.5 | 7.3 | 9.4 | 11.1 | 10.4* |
AFM++ [82] | 27.7 | 32.4 | 34.8 | 9.5 | 11.6 | 13.2 | 8.0* |
L-CNN [47] | 59.7 | 63.6 | 65.3 | 25.0 | 27.1 | 28.3 | 29.6 |
LETR [83] | 59.2 | 65.2 | 67.7 | 23.9 | 27.6 | 29.7 | 2.0 |
F-Clip [84] | 64.3 | 68.3 | 69.1 | 28.6 | 31.0 | 32.4 | 82.3 |
ELSD [85] | 64.3 | 68.9 | 70.9 | 27.6 | 30.2 | 31.8 | 42.6 |
HAWPv2 [49] | 65.7 | 69.7 | 71.3 | 28.9 | 31.2 | 32.6 | 85.2 |
HAWPv2 [49] | 63.6 | 67.7 | 69.5 | 26.6 | 29.0 | 30.3 | 85.2 |
PLNet (Ours) | 65.2 | 69.2 | 70.9 | 29.3 | 32.0 | 33.5 | 79.4† |
\({}^{1}\) Methods represented using the font are evaluated with color inputs and methods represented using the font are evaluated with grayscale inputs.
\({}^{1}\) 使用该字体的代表方法采用彩色输入进行评估,而使用该字体的代表方法则采用灰度输入进行评估。
-
These numbers are cited from the original paper.
-
这些数据引自原文。
† The FPS of our PLNet is the speed of detecting both keypoints and lines.
† 我们的 PLNet 的 FPS 是检测关键点和线条的速度。
Efficiency Analysis: It is worth noting that the FPS of our method in Table I is the speed of detecting both keypoints and lines, while other methods can only output lines. Nevertheless, our PLNet remains one of the fastest methods due to the design of the shared backbone. PLNet processes each image only \({0.86}\mathrm{\;{ms}}\) slower than the fastest algorithm,i.e.,HAWPv2.
效率分析:值得注意的是,表 I 中我们的方法的 FPS 是检测关键点和线条的速度,而其他方法只能输出线条。尽管如此,由于共享主干的架构设计,我们的 PLNet 仍然是速度最快的方法之一。PLNet 处理每张图像的速度仅比最快的算法 HAWPv2 慢 \({0.86}\mathrm{\;{ms}}\)。
Note that the selected baselines are all wireframe parsing methods. The non-wireframe-parsing line detection methods, such as SOLD2 [48] and DeepLSD [65], are not added to the comparison as it is unfair to do so. As shown in Fig. 6, the wireframe parsing techniques aim to detect structural lines. They are usually evaluated using the sAP and compared with the ground truth. The non-wireframe-parsing methods can detect more general lines with significant gradients, however, they often detect a long line segment as multiple short line segments, which results in their poor sAP performance.
请注意,所选基线均为线框解析方法。非线框解析的线条检测方法,如 SOLD2 [48] 和 DeepLSD [65],未加入比较,因为这样做不公平。如图 6 所示,线框解析技术旨在检测结构线条。它们通常使用 sAP 进行评估并与真实值进行比较。非线框解析方法可以检测具有显著梯度的更一般线条,但它们往往将一条长线段检测为多条短线段,导致其 sAP 性能较差。
B. Mapping Accuracy 映射精度
In this section, we evaluate the mapping accuracy of our system under well-illuminated conditions. The EuRoC dataset [95] is one of the most widely used datasets for vSLAM, so we use it for the accuracy evaluation. We compare our method only with systems capable of estimating the real scale, so the selected baselines are either visual-inertial systems, stereo systems, or those incorporating both. We incorporate traditional methods, learning-based systems, and hybrid systems into the comparison. We use AirVIO to represent our system without loop detection. The root mean square error (RMSE) is used as the metric and computed by the evo [96].
在本节中,我们评估了我们的系统在良好光照条件下的映射精度。EuRoC 数据集 [95] 是 vSLAM 中最广泛使用的数据集之一,因此我们使用它进行精度评估。我们仅与能够估计真实尺度的系统进行比较,因此选定的基线要么是视觉惯性系统,要么是立体系统,或者是两者的结合。我们将传统方法、基于学习的系统和混合系统纳入比较。我们使用 AirVIO 来代表我们的系统,不包括回环检测。均方根误差(RMSE)作为评估指标,并通过 evo [96] 计算。
TABLE II
Translational error (RMSE) on the EUROC dataset (UNIT: M), THE BEST RESULTS ARE IN BOLD.
EuRoC 数据集上的平移误差(RMSE)(单位:米),最佳结果以粗体显示。
Sensors | Features | Sequence | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
${\mathrm{M}}^{1}$ | ${\mathrm{S}}^{1}$ | ${\mathrm{I}}^{1}$ | ${\mathbf{P}}^{1}$ | ${\mathrm{L}}^{1}$ | MH01 | MH02 | MH03 | MH04 | MH05 | V101 | V102 | V103 | V201 | V202 | V203 | Avg2 | ||
door 1 (not)! | VINS-Fusion [3] | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | X | 0.163 | 0.178 | 0.316 | 0.331 | 0.175 | 0.102 | 0.099 | 0.112 | 0.110 | 0.124 | 0.252 | 0.178 |
Struct-VIO [45] | $\checkmark$ | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | 0.119 | 0.100 | 0.283 | 0.275 | 0.256 | 0.075 | 0.197 | 0.161 | 0.081 | 0.152 | 0.177 | 0.171 | |
PLF-VINS [86] | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | 0.143 | 0.178 | 0.221 | 0.240 | 0.260 | 0.069 | 0.099 | 0.166 | 0.083 | 0.125 | 0.183 | 0.161 | |
Kimera-VIO [61] | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | X | 0.110 | 0.100 | 0.160 | 0.240 | 0.350 | 0.050 | 0.080 | 0.070 | 0.080 | 0.100 | 0.210 | 0.141 | |
OKVIS [87] | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | X | 0.197 | 0.108 | 0.122 | 0.138 | 0.272 | 0.040 | 0.067 | 0.120 | 0.055 | 0.150 | 0.240 | 0.137 | |
AirVIO (Ours) | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | 0.074 | 0.060 | 0.114 | 0.167 | 0.125 | 0.033 | 0.132 | 0.238 | 0.036 | 0.083 | 0.168 | 0.113 | |
door प्र्र्र | iSLAM [18] | X | $\checkmark$ | $\checkmark$ | X | X | 0.302 | 0.460 | 0.363 | 0.936 | 0.478 | 0.355 | 0.391 | 0.301 | 0.452 | 0.416 | 1.133 | 0.508 |
UV-SLAM [88] | $\checkmark$ | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | 0.161 | 0.179 | 0.176 | 0.291 | 0.189 | 0.077 | 0.071 | 0.094 | 0.078 | 0.085 | 0.125 | 0.139 | |
Kimera [89] | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | X | 0.090 | 0.110 | 0.120 | 0.160 | 0.180 | 0.050 | 0.060 | 0.130 | 0.050 | 0.070 | 0.230 | 0.114 | |
OpenVINS [90] | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | X | 0.072 | 0.143 | 0.086 | 0.173 | 0.247 | 0.055 | 0.060 | 0.059 | 0.054 | 0.047 | 0.141 | 0.103 | |
Structure-PLP-SLAM [91] | X | $\checkmark$ | X | $\checkmark$ | $\checkmark$ | 0.046 | 0.056 | 0.048 | 0.071 | 0.071 | 0.091 | 0.066 | 0.065 | 0.061 | 0.061 | 0.166 | 0.073 | |
VINS-Fusion [3] | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | X | 0.052 | 0.040 | 0.052 | 0.124 | 0.088 | 0.046 | 0.053 | 0.108 | 0.040 | 0.081 | 0.098 | 0.071 | |
Maplab [92] | $\checkmark$ | X | $\checkmark$ | $\checkmark$ | X | 0.041 | 0.026 | 0.045 | 0.110 | 0.067 | 0.039 | 0.045 | 0.080 | 0.053 | 0.084 | 0.196 | 0.071 | |
SP-Loop [93] | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | X | 0.070 | 0.044 | 0.068 | 0.100 | 0.090 | 0.042 | 0.034 | 0.082 | 0.038 | 0.054 | 0.100 | 0.066 | |
PL-SLAM [4] | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | 0.042 | 0.052 | 0.040 | 0.064 | 0.070 | 0.042 | 0.046 | 0.069 | 0.061 | 0.057 | 0.126 | 0.061 | |
Basalt [14] | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | X | 0.080 | 0.060 | 0.050 | 0.100 | 0.080 | 0.040 | 0.020 | 0.030 | 0.030 | 0.020 | 0.059 | 0.052 | |
DVI-SLAM [94] | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | X | 0.042 | 0.046 | 0.081 | 0.072 | 0.069 | 0.059 | 0.034 | 0.028 | 0.040 | 0.039 | 0.055 | 0.051 | |
ORB-SLAM3 [60] | $\checkmark$ | X | $\checkmark$ | $\checkmark$ | X | 0.036 | 0.033 | 0.035 | 0.051 | 0.082 | 0.038 | 0.014 | 0.024 | 0.032 | 0.014 | 0.024 | 0.035 | |
AirSLAM (Ours) | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | 0.019 | 0.013 | 0.025 | 0.056 | 0.051 | 0.032 | 0.014 | 0.025 | 0.014 | 0.018 | 0.068 | 0.030 |
\({}^{1}\mathrm{M}\) denotes the monocular camera, \(\mathrm{S}\) denotes the stereo camera,I denotes the IMU,P denotes the keypoint feature,and L denotes the line feature. \({}^{2}\) The average error of the successful sequences.
\({}^{1}\mathrm{M}\) 表示单目相机,\(\mathrm{S}\) 表示立体相机,I 表示 IMU,P 表示关键点特征,L 表示线特征。\({}^{2}\) 成功序列的平均误差。
The comparison results are presented in Table II. We evaluate the systems with and without loop detection on 11 sequences. For the comparison without loop detection, our method outperforms other SOTA VIO methods: we achieve the best results on 8 out of 11 sequences. The average translational error of AirVIO is \({20}\%\) lower than the second-best system, i.e., Kimera-VIO. For the comparison with loop detection, our system achieves comparable performance with ORB-SLAM3 and surpasses other methods. Our AirSLAM achieves the best results on 7 sequences and ORB-SLAM3 achieves the best results on the other 5 sequences, while our average error is a little better than ORB-SLAM3. Another conclusion that can be drawn from Table II is that loop detection significantly improves the accuracy of our system. The average error of our system decreases by \({74}\%\) after the loop detection.
比较结果如表 II 所示。我们在 11 个序列上评估了有无回环检测的系统。在没有回环检测的比较中,我们的方法优于其他 SOTA VIO 方法:我们在 11 个序列中的 8 个上取得了最佳结果。AirVIO 的平均平移误差比第二好的系统 Kimera-VIO 低 \({20}\%\)。在有回环检测的比较中,我们的系统与 ORB-SLAM3 性能相当,并超越了其他方法。我们的 AirSLAM 在 7 个序列上取得了最佳结果,而 ORB-SLAM3 在其他 5 个序列上取得了最佳结果,同时我们的平均误差略优于 ORB-SLAM3。从表 II 中还可以得出的另一个结论是,回环检测显著提高了我们系统的准确性。回环检测后,我们系统的平均误差降低了 \({74}\%\)。
C. Mapping Robustness 地图鲁棒性
Although many vSLAM systems have achieved impressive accuracy as shown in the previous Section VII-B, complex lighting conditions usually render them ineffective when deployed in real applications. Therefore, in this section, we evaluate the robustness of various vSLAM systems to lighting conditions. We select several representative SOTA systems as baselines. They are ORB-SLAM3 [60], an accurate feature-based system, DROID-SLAM [17], a learning-based hybrid system, Basalt [14], a system that achieves illumination-robust optical flow tracking with the LSSD algorithm, Kimera [89], a direct visual-inertial SLAM system, and OKIVS [87], a system proven to be illumination-robust in our previous work [21].
尽管许多 vSLAM 系统如前文第七节-B 部分所示,已经实现了令人印象深刻的准确性,但复杂的照明条件通常会使它们在实际应用中失效。因此,在本节中,我们评估了各种 vSLAM 系统对照明条件的鲁棒性。我们选择了几个具有代表性的 SOTA 系统作为基准。它们是 ORB-SLAM3 [60],一个基于特征的准确系统;DROID-SLAM [17],一个基于学习的混合系统;Basalt [14],一个使用 LSSD 算法实现光照鲁棒光流跟踪的系统;Kimera [89],一个直接视觉惯性 SLAM 系统;以及 OKIVS [87],一个在我们之前的工作 [21] 中被证明具有光照鲁棒性的系统。

Fig. 7. Comparison based on the OIVIO dataset. The vertical axis is the proportion of pose errors that are less than the given alignment error threshold on the horizontal axis. Our AirSLAM achieves the most accurate result.
图 7. 基于 OIVIO 数据集的比较。纵轴表示姿态误差小于横轴上给定的对齐误差阈值的比例。我们的 AirSLAM 取得了最准确的结果。
TABLE III
RMSE (M) ON THE OIVIO DATASET, THE BEST RESULTS ARE IN BOLD. F REPRESENTS TRACKING FAILURE OR LARGE DIRFT ERROR.
OIVIO 数据集上的 RMSE(M),最佳结果以粗体显示。F 表示跟踪失败或大漂移误差。
Sequence | Kimera | PL- SLAM | Basalt | DROID- SLAM | ORB- SLAM3 | Ours |
---|---|---|---|---|---|---|
MN_015_GV_01 | 0.169 | 1.238 | 0.216 | 0.286 | 0.066 | 0.054 |
MN_015_GV_02 | 2.408 | 0.853 | 0.153 | 0.081 | 0.069 | 0.052 |
MN_050_GV_01 | F | 1.143 | 0.186 | 0.173 | 0.063 | 0.062 |
MN_050_GV_02 | F | 0.921 | 0.103 | 0.080 | 0.053 | 0.048 |
MN_100_GV_01 | F | 0.831 | 0.197 | 0.184 | 0.051 | 0.064 |
MN_100_GV_02 | 2.238 | 0.609 | 0.092 | 0.090 | 0.063 | 0.042 |
TN_015_GV_01 | 0.300 | 1.579 | 0.148 | 0.188 | 0.053 | 0.057 |
TN_050_GV_01 | 0.280 | 1.736 | 0.521 | 0.313 | 0.082 | 0.065 |
TN_100_GV_01 | 0.264 | 1.312 | 0.116 | 0.179 | 0.086 | 0.078 |
Average | $-$ | 1.358 | 0.192 | 0.175 | 0.065 | 0.058 |
We test these methods and our system in three scenarios: onboard illumination, dynamic illumination, and low-lighting environments. We first present the evaluation results in Section VII-C1, Section VII-C2, and Section VII-C3, respectively, and then give an overall analysis in Section VII-C4.
我们在三种场景下测试这些方法和我们的系统:车载照明、动态照明和低光照环境。我们首先分别在第七章节 C1、C2 和 C3 中展示评估结果,然后在第七章节 C4 中给出总体分析。
-
Onboard Illumination: We utilize the OIVIO dataset [75] to assess the performance of various systems with onboard illumination. The OIVIO dataset collects visual-inertial data in tunnels and mines. In each sequence, the scene is illuminated by an onboard light of approximately 1300,4500 , or 9000 lumens. We used all nine sequences with ground truth acquired by the Leica TCRP1203 R300. As no loop closure exists in the selected sequences, it is fair to compare the VO systems with the SLAM systems. The performance of translational error is presented in Table III. The most accurate results are in bold, and \(\mathrm{F}\) represents that the tracking is lost for more than \({10}\mathrm{\;s}\) or the RMSE exceeds \({10}\mathrm{\;m}\) . It can be seen that our method achieves the most accurate results on 7 out of 9 sequences and the smallest average error. The onboard illumination has almost no impact on our AirSLAM and ORB-SLAM3, however, it reduces the accuracy of OKVIS, Basalt, and PL-SLAM. Kimera suffers a lot from such illumination conditions. It even experiences tracking failures and large drift errors on three sequences.
-
车载照明:我们利用 OIVIO 数据集 [75] 来评估各种系统在车载照明下的性能。OIVIO 数据集收集了隧道和矿井中的视觉-惯性数据。在每个序列中,场景由大约 1300、4500 或 9000 流明的车载灯光照亮。我们使用了所有九个序列,其地面真实数据由徕卡 TCRP1203 R300 获取。由于所选序列中不存在闭环,因此公平地比较 VO 系统和 SLAM 系统。平移误差的性能在表 III 中展示。最准确的结果以粗体显示,\(\mathrm{F}\) 表示跟踪丢失超过 \({10}\mathrm{\;s}\) 或 RMSE 超过 \({10}\mathrm{\;m}\)。可以看出,我们的方法在 9 个序列中取得了 7 个最准确的结果和最小的平均误差。车载照明对我们的 AirSLAM 和 ORB-SLAM3 几乎没有影响,但它降低了 OKVIS、Basalt 和 PL-SLAM 的准确性。Kimera 在这种照明条件下受到很大影响。它在三个序列中甚至经历了跟踪失败和大漂移误差。

Fig. 8. Our feature detection and matching on a challenging sequence in UMA-VI dataset. The red lines represent detected line features and the colored lines across images indicate feature association. The image may suddenly go dark due to turning off the lights, which is very difficult for vSLAM systems.
图8展示了我们在UMA-VI数据集中的一个具有挑战性的序列上的特征检测和匹配。红色线条代表检测到的线特征,而跨图像的彩色线条表示特征关联。由于灯光突然关闭,图像可能会变暗,这对vSLAM系统来说非常困难。
We show a comparison of our method with selected baselines on the OIVIO TN_100_GV_01 sequence in Fig. 7. In this case, the robot goes through a mine with onboard illumination. The distance is about 150 meters and the average speed is about \({0.84}\mathrm{\;m}/\mathrm{s}\) . The plot shows the proportion of pose errors on the horizontal axis that are less than the given alignment error threshold on the horizontal axis. Our system achieves a more accurate result than other systems on this sequence.
我们在图7中展示了我们的方法与OIVIO TN_100_GV_01序列上选定的基线的比较。在这种情况下,机器人通过一个带有车载照明的矿井。距离约为150米,平均速度约为\({0.84}\mathrm{\;m}/\mathrm{s}\)。图中显示了水平轴上小于给定对齐误差阈值的位姿误差的比例。我们的系统在这个序列上比其他系统取得了更准确的结果。
-
Dynamic Illumination: The UMA-VI dataset is a visual-inertial dataset gathered in challenging scenarios with handheld custom sensors. We selected sequences with illumination changes to evaluate our system. As shown in Fig. 8, it contains many sub-sequences where the image suddenly goes dark due to turning off the lights. It is more challenging than the OIVIO dataset for vSLAM systems. As the ground-truth poses are only available at the beginning and the end of each sequence, we disabled the loop closure part from all the evaluated methods.
-
动态光照:UMA-VI数据集是一个在具有挑战性的场景中使用手持自定义传感器收集的视觉-惯性数据集。我们选择了具有光照变化的序列来评估我们的系统。如图8所示,它包含许多子序列,由于灯光关闭,图像会突然变暗。这对vSLAM系统来说比OIVIO数据集更具挑战性。由于地面真实位姿仅在每个序列的开始和结束时可用,我们禁用了所有评估方法中的回环闭合部分。
The translational errors are presented in Table IV. The most accurate results are in bold,and \(\mathrm{F}\) represents that the tracking is lost for more than \({10}\mathrm{\;s}\) or the RMSE exceeds \({10}\mathrm{\;m}\) . It can be seen that our AirSLAM outperforms other methods. Our system achieves the best results on 7 out of 10 sequences. The UMA-VI dataset is so challenging that PL-SLAM and ORB-SLAM3 fail on most sequences. Although OKVIS and Basalt, like our system, can complete all the sequences, their accuracy is significantly lower than ours. The average RMSEs of OKVIS and Basalt are around \({1.134}\mathrm{\;m}\) and \({0.724}\mathrm{\;m}\) ,respectively,while ours is around \({0.441}\mathrm{\;m}\) ,which means our average error is only \({62.6}\%\) of OKVIS and \({38.9}\%\) of Basalt.
表 IV 中展示了转换误差。最准确的结果以粗体显示,\(\mathrm{F}\) 表示跟踪丢失超过 \({10}\mathrm{\;s}\) 或均方根误差(RMSE)超过 \({10}\mathrm{\;m}\)。可以看出,我们的 AirSLAM 优于其他方法。我们的系统在 10 个序列中有 7 个取得了最佳结果。UMA-VI 数据集极具挑战性,以至于 PL-SLAM 和 ORB-SLAM3 在大多数序列上失败。尽管 OKVIS 和 Basalt 像我们的系统一样能够完成所有序列,但它们的准确性明显低于我们。OKVIS 和 Basalt 的平均 RMSE 分别约为 \({1.134}\mathrm{\;m}\) 和 \({0.724}\mathrm{\;m}\),而我们的约为 \({0.441}\mathrm{\;m}\),这意味着我们的平均误差仅为 OKVIS 的 \({62.6}\%\) 和 Basalt 的 \({38.9}\%\)。
TABLE IV
RMSE (M) ON THE UMA-VI DATASET, THE BEST RESULTS ARE IN BOLD. F REPRESENTS TRACKING FAILURE OR LARGE DIRFT ERROR.
UMA-VI 数据集上的 RMSE(米),最佳结果以粗体显示。F 表示跟踪失败或大漂移误差。
Sequence | PL- SLAM | ORB- SLAM3 | Basalt | OKVIS | DROID- SLAM | Ours |
---|---|---|---|---|---|---|
conference-csc 1 | 2.697 | F | 1.270 | 1.118 | 0.711 | 0.490 |
conference-csc2 | 1.596 | F | 0.682 | 0.470 | 0.135 | 0.091 |
conference-csc3 | F | 0.426 | 0.469 | 0.088 | 0.724 | 0.088 |
lab-module-csc-rev | F | 0.063 | 0.486 | 0.861 | 0.364 | 0.504 |
lab-module-csc | F | F | 0.403 | 0.579 | 0.319 | 0.979 |
long-walk-eng | F | F | 5.046 | 3.005 | F | 1.801 |
third-floor-csc1 | 4.478 | 0.863 | 0.420 | 0.287 | 0.048 | 0.070 |
third-floor-csc2 | 6.068 | 0.149 | 0.590 | 0.271 | 0.890 | 0.127 |
two-floors-csc1 | F | F | 0.760 | 0.154 | 0.341 | 0.066 |
two-floors-csc2 | F | F | 1.211 | 0.679 | 0.299 | 0.190 |

Fig. 9. We use the gamma nonlinearity to generate image sequences with low illumination. \(i\) is the brightness level. \(A\) and \(\gamma\) are parameters to control the image brightness. The smaller A and B are, the darker the image.
图 9. 我们使用伽马非线性来生成低照度的图像序列。\(i\) 是亮度级别。\(A\) 和 \(\gamma\) 是控制图像亮度的参数。A 和 B 越小,图像越暗。
-
Low Illumination: Inspired by [15], we process a publicly available sequence by adjusting the brightness levels of its images. Then the processed sequences are used to evaluate the performance of various SLAM systems in low-illumination conditions. We select the "V2_01_easy" of the EuRoC dataset as the base sequence. The image brightness is adjusted using the gamma nonlinearity:
-
低照度:受 [15] 启发,我们通过调整其图像的亮度级别来处理一个公开可用的序列。然后使用处理后的序列来评估各种 SLAM 系统在低照度条件下的性能。我们选择 EuRoC 数据集的 "V2_01_easy" 作为基础序列。使用伽马非线性调整图像亮度:
where \({V}_{\text{in }}\) and \({V}_{\text{out }}\) are normalized input and output pixel values,respectively. \(A\) and \(\gamma\) control the maximum brightness and contrast. We set 12 adjustment levels and use \({L}_{i}\) to denote the \(i\) th level. \({L}_{0}\) represents the original sequence,i.e., \({A}_{0} = 1\) and \({\gamma }_{0} = 1\) . When \(i \in \left\lbrack {1,{12}}\right\rbrack ,{A}_{i}\) and \({\gamma }_{i}\) alternate in descending order to make the image progressively darker. Fig. 9 shows the values of \({A}_{i}\) and \({\gamma }_{i}\) ,and the processed image in each level. We name the processed dataset "Dark EuRoC".
其中 \({V}_{\text{in }}\) 和 \({V}_{\text{out }}\) 分别是归一化的输入和输出像素值,\(A\) 和 \(\gamma\) 控制最大亮度和对比度。我们设置了12个调整级别,并用 \({L}_{i}\) 表示第 \(i\) 级。\({L}_{0}\) 代表原始序列,即 \({A}_{0} = 1\) 和 \({\gamma }_{0} = 1\)。当 \(i \in \left\lbrack {1,{12}}\right\rbrack ,{A}_{i}\) 和 \({\gamma }_{i}\) 按降序交替以使图像逐渐变暗。图9展示了 \({A}_{i}\) 和 \({\gamma }_{i}\) 的值,以及每个级别的处理图像。我们将处理后的数据集命名为“Dark EuRoC”。
TABLE V
The relocalization comparison on the TartanAir Day/Night Localization dataset, the best results are in bold.
在TartanAir Day/Night Localization数据集上的重定位比较,最佳结果以粗体显示。
Global Feature ${}^{1} +$ ${\text{Matching}}^{1} + {\text{Local Feature}}^{1}$ | FPS2 | Recall Rate of Sequences (%) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P000 | P001 | P002 | P003 | P004 | P005 | P006 | P007 | P008 | P009 | P010 | P011 | Avg | |||
(17,338) | NV + NN + SOSNet | 12.4 | 4.3 | 10.6 | 42.5 | 10.5 | 33.9 | 26.1 | 7.2 | 27.2 | 32.7 | 6.5 | 24.9 | 7.7 | 19.5 |
NV + NN + D2-Net | 10.34 | 22.6 | 20.7 | 87.4 | 19.0 | 20.1 | 65.9 | 27.5 | 49.0 | 69.0 | 64.3 | 71.4 | 23.4 | 45.0 | |
NV + NN + R2D2 | 8.14 | 15.7 | 32.3 | 92.4 | 22.7 | 52.7 | 75.5 | 14.4 | 84.9 | 54.8 | 67.7 | 60.7 | 26.0 | 50.0 | |
NV + NN + SP | 30.2 | 69.2 | 32.8 | 88.7 | 17.2 | 53.5 | 72.6 | 31.1 | 83.8 | 72.6 | 69.6 | 89.5 | 34.0 | 59.6 | |
NV + AL + SOSNet | 6.1 | 29.0 | 30.3 | 55.6 | 19.6 | 49.3 | 42.4 | 12.7 | 45.8 | 38.9 | 25.2 | 47.0 | 11.3 | 34.0 | |
NV + LG + SIFT | 10.8 | 45.9 | 31.3 | 83.0 | 58.5 | 52.4 | 64.7 | 18.8 | 78.8 | 57.2 | 43.6 | 79.8 | 33.6 | 53.7 | |
NV + LG + DISK | 12.9 | 22.9 | 35.4 | 97.9 | 62.9 | 60.9 | 84.2 | 33.8 | 89.9 | 90.0 | 85.5 | 74.9 | 33.4 | 64.3 | |
NV + LG + SP | 20.6 | 94.7 | 35.4 | 98.4 | 70.9 | 63.5 | 85.8 | 33.8 | 95.9 | 97.1 | 91.4 | 99.2 | 49.8 | 76.3 | |
DIR + LG + SP | 19.1 | 87.2 | 28.8 | 99.5 | 69.8 | 64.2 | 85.2 | 32.8 | 96.7 | 97.1 | 88.1 | 96.0 | 50.4 | 74.7 | |
OpenIBL + LG + SP | 21.3 | 88.6 | 35.4 | 97.7 | 70.3 | 64.3 | 86.1 | 34.7 | 96.4 | 91.3 | 95.7 | 99.7 | 45.1 | 75.4 | |
EP + LG + SP | 22.3 | 99.7 | 36.4 | 100.0 | 68.3 | 64.5 | 85.8 | 34.4 | 96.5 | 94.0 | 85.1 | 96.9 | 45.1 | 75.6 | |
AirSLAM (Ours) | 48.8 | 89.5 | 78.8 | 87.8 | 94.6 | 88.2 | 85.6 | 70.0 | 72.8 | 83.4 | 78.8 | 81.4 | 54.5 | 80.5 |
\({}^{1}\) NV is NetVLAD,DIR is AP-GeM/DIR,EP is EigenPlace,NN is Nearest Neighbor Matching,AL is AdaLAM,LG is LightGlue,and SP is SuperPoint. \({}^{2}\) Running time of relocalization measured in Frame Per Second (FPS).
\({}^{1}\) NV代表NetVLAD,DIR代表AP-GeM/DIR,EP代表EigenPlace,NN代表最近邻匹配,AL代表AdaLAM,LG代表LightGlue,SP代表SuperPoint。\({}^{2}\) 重定位的运行时间以每秒帧数(FPS)衡量。

Fig. 10. The comparison results on the Dark EuRoC dataset. The higher the level of low illumination on the x-axis, the darker the image. Basalt is the most stable system while our system is more accurate.
图10展示了Dark EuRoC数据集上的比较结果。x轴上的低光照级别越高,图像越暗。Basalt系统最为稳定,而我们的系统更为准确。
We present the comparison result in Fig. 10. As the errors of PL-SLAM are much greater than other methods, we do not show its result. Tracking failures and large drift errors, i.e., the RMSE is more than \(1\mathrm{\;m}\) ,are also marked. It can be seen that low illumination has varying degrees of impact on different systems. Our system and Basalt achieve the best outcome. Basalt is more stable for low illumination: its RMSE remains almost unchanged under different brightness levels. Our system is more accurate: AirSLAM has the smallest error on most sequences. The RMSEs of ORB-SLAM3 and OKVIS increase as the brightness decreases. They even experience tracking failures or large drift errors on \({L}_{10}\) and \({L}_{11}\) .
我们在图10中展示了比较结果。由于PL-SLAM的误差远大于其他方法,我们没有展示其结果。跟踪失败和大漂移误差,即均方根误差(RMSE)超过\(1\mathrm{\;m}\),也被标记出来。可以看出,低光照对不同系统的影响程度不同。我们的系统和Basalt取得了最佳结果。Basalt在低光照下更为稳定:其在不同亮度水平下的RMSE几乎保持不变。我们的系统更为精确:AirSLAM在大多数序列中具有最小的误差。随着亮度降低,ORB-SLAM3和OKVIS的RMSE增加。它们甚至在\({L}_{10}\)和\({L}_{11}\)上经历了跟踪失败或大漂移误差。
-
Result Analysis: We think the above three lighting conditions affect a visual system in different ways. The OIVIO dataset collects sequences in dark environments with only onboard illumination, so the light source moves along with the robot, which results in two effects. On the one hand, the lighting is uneven in the environment. The direction the robot is facing and the area closer to the robot is brighter than other areas. The uneven image brightness may lead to the uneven distribution of features. On the other hand, when the robot moves, the lighting of the same area will change, resulting in different brightness in different frames. The assumption of brightness constancy in some systems will be affected in such conditions. The UMA-VI dataset is collected under dynamic lighting conditions, where the dynamic lighting is caused by the sudden switching of lights or moving between indoor and outdoor environments. The image brightness variations in the UMA-VI dataset are much more intense than those in the OIVIO dataset, which may even make the extracted feature descriptor inconsistent in consecutive frames. In low-illumination environments, both the brightness and contrast of captured images are very low, making the vSLAM system more difficult to detect enough good features and extract distinct descriptors.
-
结果分析:我们认为上述三种光照条件以不同方式影响视觉系统。OIVIO数据集收集了仅依靠车载照明的黑暗环境中的序列,因此光源随机器人移动,这导致了两种效应。一方面,环境中的光照不均匀。机器人所面对的方向和靠近机器人的区域比其他区域更亮。图像亮度的不均匀可能导致特征分布的不均匀。另一方面,当机器人移动时,同一区域的光照会发生变化,导致不同帧中的亮度不同。在这种条件下,某些系统中亮度恒定的假设将受到影响。UMA-VI数据集是在动态光照条件下收集的,动态光照是由灯光突然切换或室内外环境切换引起的。UMA-VI数据集中的图像亮度变化比OIVIO数据集更为剧烈,这甚至可能导致连续帧中提取的特征描述符不一致。在低光照环境中,捕获图像的亮度和对比度都非常低,使得vSLAM系统更难以检测到足够好的特征并提取出清晰的描述符。

Fig. 11. Some image pair samples for the mapping and relocalization in the TartanAir Day/Night Localization dataset. Due to the differences in capture viewpoints and scene depths, not all the image pairs have a valid overlap.
图11. TartanAir日/夜定位数据集中用于地图构建和重定位的一些图像对样本。由于捕获视角和场景深度的差异,并非所有图像对都具有有效的重叠区域。
We summarize the above experiment results with the following conclusions. First, the systems that use descriptors for matching are more robust than the direct methods in illumination-dynamic environments. On the OIVIO dataset, our AirSLAM and ORB-SLAM3 outperform the other systems significantly. On the UMA-VI dataset, our method and OKVIS achieve the best and the second-best results, respectively. This is reasonable as the brightness constancy assumption constrains the direct methods. Despite Basalt uses LSSD to enhance its optical flow tracking, its accuracy still decreases significantly in these two scenarios. Second, the direct methods are more stable in the low illumination environments. This is because descriptor-based SLAM systems rely on enough high-quality features and descriptors, which are difficult to obtain on low brightness and contrast images. The direct methods use corners that are easier to detect, so the low illumination has less impact on them. Third, thanks to the robust feature detection and matching, the illumination robustness of our system is far better than that of other systems. AirSLAM achieves relatively high accuracy in these three illumination-challenging scenarios.
我们总结上述实验结果,得出以下结论。首先,在光照动态环境中,使用描述符进行匹配的系统比直接方法更为稳健。在OIVIO数据集上,我们的AirSLAM和ORB-SLAM3显著优于其他系统。在UMA-VI数据集上,我们的方法和OKVIS分别取得了最佳和次佳的结果。这是合理的,因为亮度恒定假设限制了直接方法。尽管Basalt使用LSSD来增强其光流跟踪,但在这两种场景下其准确性仍显著下降。其次,直接方法在低光照环境中更为稳定。这是因为基于描述符的SLAM系统依赖于足够的高质量特征和描述符,而在低亮度和对比度图像上难以获得这些特征。直接方法使用更容易检测的角点,因此低光照对其影响较小。第三,得益于鲁棒的特征检测和匹配,我们的系统在光照鲁棒性方面远超其他系统。AirSLAM在这三种光照挑战性场景中实现了相对较高的准确性。
D. Map Reuse 地图重用
-
Dataset: As mapping and relocalization in the same well-illuminated environment are no longer difficult for many current vSLAM systems, we only evaluate our map reuse module under illumination-challenging conditions, i.e., the day/night localization task. We use the "abandoned_factory" and "abandoned_factory_night" scenes in the TartanAir dataset [58] as they can provide consecutive stereo image sequences for the SLAM mapping and the corresponding accurate ground truth for the evaluation. The images in these two scenes are collected during the day and at night, respectively. We use the sequences in the "abandoned_factory" scene to build maps. Then, for each mapping image, the images with a relative distance of less than \(3\mathrm{\;m}\) and a relative angle of less than \({15}^{ \circ }\) from it in the "abandoned_factory_night" scene are selected as query images. We call the generated mapping and relocalization sequences the "TartanAir Day/Night Localization" dataset. Fig. 11 shows some sample pairs for mapping and relocalization. It is worth noting that due to the differences in capture viewpoints and scene depths, the query image selected based on the relative distance and angle may not always have valid overlapping with the mapping images.
-
数据集:由于在相同良好光照环境下的地图构建和重定位对于许多当前的vSLAM系统来说已不再困难,我们仅在光照挑战条件下评估我们的地图复用模块,即日/夜定位任务。我们使用TartanAir数据集[58]中的“废弃工厂”和“废弃工厂_夜晚”场景,因为它们可以提供连续的立体图像序列用于SLAM地图构建以及相应的准确地面实况用于评估。这两个场景中的图像分别在白天和夜间收集。我们使用“废弃工厂”场景中的序列来构建地图。然后,对于每个地图图像,在“废弃工厂_夜晚”场景中选择与其相对距离小于\(3\mathrm{\;m}\)且相对角度小于\({15}^{ \circ }\)的图像作为查询图像。我们将生成的地图构建和重定位序列称为“TartanAir日/夜定位”数据集。图11展示了一些用于地图构建和重定位的样本对。值得注意的是,由于捕获视角和场景深度的差异,基于相对距离和角度选择的查询图像可能并不总是与地图图像有有效的重叠。
-
Baseline: We have tried several traditional vSLAM systems, e.g., ORB-SLAM3 [60], and SOTA learning-based one-stage relocalization methods, e.g., ACE [97] on the TartanAir Day/Night Localization dataset, and find they perform badly: their relocalization recall rates are below \(1\%\) . Therefore, we only present the comparison results of our systems and some VPR methods. The Hloc toolbox [8] uses the structure from motion (SFM) method to build maps and has integrated many image retrieval methods, local feature extractors, and matching methods for localization. We mainly compare our system with these methods. Specifically, the NetVLAD [62], AP-GeM/DIR [98], OpenIBL [99], and EigenPlaces [100] are used to extract global features, the SuperPoint [23], SIFT [101], D2-Net [39], SOSNet [102], R2D2 [37], and DISK [38] are used to extract local features, and the LightGlue [26], AdaLAM [103] and Nearest Neighbor Matching are used to match features. We combine these methods into various "global feature + matching + local feature" mapping and relocalization pipelines.
-
基线:我们尝试了多种传统的vSLAM系统,例如ORB-SLAM3 [60],以及基于学习的单阶段重定位方法,例如ACE [97],在TartanAir日/夜定位数据集上,发现它们表现不佳:它们的重定位召回率低于 \(1\%\)。因此,我们仅展示我们的系统与一些VPR方法的比较结果。Hloc工具箱 [8] 使用运动结构(SFM)方法构建地图,并集成了许多图像检索方法、局部特征提取器和匹配方法用于定位。我们主要将我们的系统与这些方法进行比较。具体来说,NetVLAD [62]、AP-GeM/DIR [98]、OpenIBL [99] 和 EigenPlaces [100] 用于提取全局特征,SuperPoint [23]、SIFT [101]、D2-Net [39]、SOSNet [102]、R2D2 [37] 和 DISK [38] 用于提取局部特征,LightGlue [26]、AdaLAM [103] 和最近邻匹配用于特征匹配。我们将这些方法组合成各种“全局特征 + 匹配 + 局部特征”的地图构建和重定位流程。
-
Results: To achieve a fair comparison and balance the efficiency and effectiveness, we extract 400 local features and retrieve 3 candidates in the coarse localization stage for all methods. Unlike vSLAM systems that have the keyframe selection mechanism, the SFM mapping optimizes all input images, so it is very slow when mapping with original sequences. Therefore, to accelerate the SFM mapping while ensuring its mapping frames are more than our keyframes, we sample its mapping sequences by selecting one frame every four frames. We show a point-line map built by our AirSLAM in Fig. 12. The relocalization results are presented in Table V. We give the running time (FPS) and the relocalization recall rate of each method. We define a successful relocalization if the estimated pose of the query frame is within \(2\mathrm{\;m}\) and \({15}^{ \circ }\) of the ground truth. It can be seen that our AirSLAM outperforms other methods in terms of both efficiency and recall rate. Our system achieves the best results on 5 out of 11 sequences. AirSLAM has an average recall rate \({4.2}\%\) higher than the second-best algorithm and is about 2.4 times faster than it.
-
结果:为了实现公平比较并平衡效率和效果,我们提取了400个局部特征,并在粗定位阶段为所有方法检索3个候选对象。与具有关键帧选择机制的vSLAM系统不同,SFM建图优化所有输入图像,因此在使用原始序列进行建图时非常缓慢。因此,为了加速SFM建图同时确保其建图帧数多于我们的关键帧,我们通过每隔四帧选择一帧来采样其建图序列。我们在图12中展示了由我们的AirSLAM构建的点线地图。重定位结果在表V中呈现。我们给出了每种方法的运行时间(FPS)和重定位召回率。如果查询帧的估计姿态在真实值的\(2\mathrm{\;m}\)和\({15}^{ \circ }\)范围内,我们定义为成功重定位。可以看出,我们的AirSLAM在效率和召回率方面均优于其他方法。我们的系统在11个序列中的5个上取得了最佳结果。AirSLAM的平均召回率\({4.2}\%\)高于第二好的算法,并且速度大约是其2.4倍。

Fig. 12. A point-line map of the P000 sequence built by our AirSLAM. The red points are mappoints and the blue lines are 3D lines.
图12. 由我们的AirSLAM构建的P000序列的点线地图。红色点是地图点,蓝色线是3D线。
TABLE VI
Ablation study. The recall rates (%) of our system with and WITHOUT THE STRUCTURE GRAPH IN THE MAP REUSE MODULE.
消融研究。我们的系统在有和没有地图重用模块中的结构图时的召回率(%)。
Seq. | ${N}_{C} = 3$ | ${N}_{C} = 5$ | ${N}_{C} = {10}$ | |||
---|---|---|---|---|---|---|
w/o G. | Ours | w/o G. | Ours | w/o G. | Ours | |
P000 | 77.9 | 89.5 | 84.3 | 92.5 | 88.5 | 94.0 |
P001 | 69.2 | 78.8 | 79.3 | 83.8 | 85.4 | 87.9 |
P002 | 75.4 | 87.8 | 80.9 | 87.8 | 85.3 | 88.0 |
P003 | 86.4 | 94.6 | 92.4 | 95.5 | 93.9 | 95.9 |
P004 | 81.5 | 88.2 | 85.3 | 90.9 | 89.0 | 92.1 |
P005 | 78.4 | 85.6 | 82.9 | 88.8 | 88.4 | 91.5 |
P006 | 62.4 | 70.0 | 72.3 | 72.7 | 78.6 | 77.2 |
P007 | 60.6 | 72.8 | 63.7 | 73.7 | 69.0 | 76.2 |
P008 | 77.2 | 83.4 | 80.0 | 84.4 | 81.7 | 85.8 |
P009 | 63.4 | 78.8 | 70.6 | 81.7 | 76.9 | 84.8 |
P010 | 70.7 | 81.4 | 75.4 | 84.0 | 80.9 | 86.9 |
P011 | 48.1 | 54.5 | 55.1 | 58.9 | 62.0 | 64.0 |
Avg. | 70.9 | 80.5 | 76.9 | 82.9 | 81.6 | 85.4 |
-
Analysis: We find that our system is more stable than the VPR methods on the TartanAir Day/Night Localization dataset. On several sequences, e.g., P000, P002, and P010, some VPR methods achieve remarkable results, with recall rates close to \({100}\%\) . However,on some other sequences,e.g.,P001 and P006, their recall rates are less than \({40}\%\) . In contrast,our system maintains a recall rate of \({70}\%\) to \({90}\%\) on most sequences.
-
分析:我们发现,我们的系统在TartanAir昼夜定位数据集上比VPR方法更稳定。在几个序列中,例如P000、P002和P010,一些VPR方法取得了显著的结果,召回率接近\({100}\%\)。然而,在其他一些序列中,例如P001和P006,它们的召回率低于\({40}\%\)。相比之下,我们的系统在大多数序列上保持了\({70}\%\)到\({90}\%\)的召回率。
To clarify this, we examined each sequence and roughly categorized the images into three types. As shown in Fig. 11, the first type of image is captured with the camera relatively far from the features, so there is a significant overlap between each day/night image pair. Additionally, these images contain distinct buildings and landmarks. The second type of image pair also has a significant overlap. However, the common regions in these image pairs do not contain large buildings and landmarks. The third type of image is captured with the camera very close to the features. Although the camera distance between the day/night image pair is not large, they have almost no overlap, making their local feature matching impossible.
为了澄清这一点,我们检查了每个序列并将图像大致分为三种类型。如图11所示,第一种图像是在相机相对远离特征的情况下拍摄的,因此每个昼夜图像对之间有显著的重叠。此外,这些图像包含独特的建筑物和地标。第二种图像对也有显著的重叠。然而,这些图像对中的共同区域不包含大型建筑物和地标。第三种图像是在相机非常接近特征的情况下拍摄的。尽管昼夜图像对之间的相机距离不大,但它们几乎没有重叠,使得局部特征匹配变得不可能。

Fig. 13. For odometry efficiency, we compare with the traditional methods and disable the loop detection from all the methods. For mapping efficiency, we compute the total mapping time and compare it with Hloc.
图13。为了测距效率,我们与传统方法进行了比较,并禁用了所有方法中的回环检测。为了建图效率,我们计算了总建图时间并与Hloc进行了比较。
We find that the VPR methods perform very well on the first type of image but perform poorly on the second type of image. Therefore, their recall rates are very low on P001 and P006, which contain more of the second type of images. This may be because their global features are usually trained on datasets that have a lot of distinct buildings and landmarks, which makes them rely more on such semantic cues to retrieve similar images. By contrast, our system is based on the DBoW method, which only utilizes the low-level local features of images, so it achieves similar performance on the first and second types of images. This also proves the good generalization ability of our system. However, neither our system nor VPR methods can process the pairs with little overlap due to relying on local feature matching. Such image pairs are abundant in the P011.
我们发现,VPR方法在第一类图像上表现非常好,但在第二类图像上表现不佳。因此,它们在P001和P006上的召回率非常低,这两个数据集包含更多第二类图像。这可能是因为它们的全局特征通常在具有大量独特建筑和地标的训练集上进行训练,这使得它们更依赖于这种语义线索来检索相似图像。相比之下,我们的系统基于DBoW方法,该方法仅利用图像的低级局部特征,因此在第一类和第二类图像上实现了相似的性能。这也证明了我们系统良好的泛化能力。然而,无论是我们的系统还是VPR方法,都无法处理重叠较少的图像对,因为它们依赖于局部特征匹配。这种图像对在P011中非常丰富。
E. Ablation Study 消融研究
In this part, we verify the effectiveness of the relocalization method. This experiment is conducted on the TartanAir Day/Night Localization dataset. We compare the systems with and without the second step proposed in Section VI-B2. The results are presented in Table VI,where w/o G. denotes our system without the structure graph,and \({N}_{C}\) denotes the candidate number for local feature matching. It shows that using junctions, line features, and structure graphs to filter out relocalization candidates significantly improves recall rates. AirSLAM outperforms w/o G. across all sequences, and when \({N}_{C}\) is \(3,5\) ,and 10,the average improvements are \({9.6}\%\) , \({6.0}\%\) ,and \({3.8}\%\) ,respectively,which demonstrates the effective performance of the proposed method.
在这一部分,我们验证了重定位方法的有效性。该实验在TartanAir Day/Night Localization数据集上进行。我们比较了有无第六节B2部分提出的第二步的系统。结果如表VI所示,其中w/o G.表示我们的系统没有结构图,\({N}_{C}\)表示用于局部特征匹配的候选数量。结果显示,使用交叉点、线条特征和结构图来过滤重定位候选对象显著提高了召回率。AirSLAM在所有序列上都优于w/o G.,当\({N}_{C}\)为\(3,5\)和10时,平均改进分别为\({9.6}\%\)、\({6.0}\%\)和\({3.8}\%\),这证明了所提方法的有效性能。
F. Efficiency Analysis 效率分析
Efficiency is essential for robotics applications, so we also evaluate the efficiency of the proposed system. We first compare the running time of our AirSLAM with several SOTA VO and SLAM systems on a computer with an Intel i9-13900 CPU and an NVIDIA RTX 4070 GPU. Then we deploy AirSLAM on an NVIDIA Jetson AGX Orin to verify the efficiency and performance of our system on the embedded platform.
效率对于机器人应用至关重要,因此我们同样评估了所提出系统的效率。我们首先在配备Intel i9-13900 CPU和NVIDIA RTX 4070 GPU的计算机上,将我们的AirSLAM与多个SOTA视觉里程计(VO)和同时定位与地图构建(SLAM)系统的运行时间进行比较。然后,我们将AirSLAM部署在NVIDIA Jetson AGX Orin上,以验证我们的系统在嵌入式平台上的效率和性能。

Fig. 14. Accuracy comparison of our system on an NVIDIA Jetson Orin and a PC. The vertical axis represents ATE in meters.
图14. 我们的系统在NVIDIA Jetson Orin和PC上的精度比较。纵轴表示绝对轨迹误差(ATE),单位为米。
TABLE VII
EFFICIENCY COMPARISON OF OUR SYSTEM ON TWO PLATFORMS.
我们的系统在两个平台上的效率比较。
Platform | Runtime | CPU | GPU Usage (MB) | |
---|---|---|---|---|
VIO (FPS) | Optim. ${}^{1}$ (s) | Usage (%) | ||
Ours-Jetson | 40.3 | 57.8 | 224.7 | 989 |
Ours-PC | 73.1 | 55.5 | 217.8 | 3076 |
\({}^{1}\) The runtime of the offline map optimization.
\({}^{1}\) 离线地图优化的运行时间。
-
Odometry Efficiency: The VO/VIO efficiency experiment is conducted on the MH_01_easy sequence of the EuRoC dataset. We compare our AirSLAM with several SOTA systems. The loop detection and GBA are disabled from all the systems for a fair comparison. The metrics are the runtime per frame and the CPU usage. The results are presented in Fig. 13a, where \({100}\%\) CPU usage means \(1\mathrm{{CPU}}\) core is utilized. It should be additionally noted that DROID-SLAM actually uses 32 CPU cores, and its CPU usage in Fig. 13a is only for a compact presentation. Our system is the fastest among these systems, achieving a rate of 73 FPS. In addition, due to extracting and matching features using the GPU, our system requires relatively less CPU resources. We also test the GPU usage. It shows that DROID-SLAM requires about \(8\mathrm{{GB}}\) GPU memory,while our AirSLAM only requires around 3GB.
-
里程计效率:在EuRoC数据集的MH_01_easy序列上进行了视觉里程计(VO)/视觉惯性里程计(VIO)效率实验。我们将AirSLAM与多个SOTA系统进行了比较。为了公平比较,所有系统的回环检测和全局捆绑调整(GBA)都被禁用。评估指标是每帧的运行时间和CPU使用率。结果如图13a所示,其中\({100}\%\) CPU使用率表示\(1\mathrm{{CPU}}\)核心被利用。此外,需要注意的是DROID-SLAM实际上使用了32个CPU核心,而图13a中显示的CPU使用率仅为紧凑展示。在这些系统中,我们的系统是最快的,达到了73帧每秒的速度。此外,由于使用GPU进行特征提取和匹配,我们的系统所需的CPU资源相对较少。我们还测试了GPU使用情况。结果显示,DROID-SLAM需要大约\(8\mathrm{{GB}}\)的GPU内存,而我们的AirSLAM仅需要约3GB。
-
Mapping Efficiency: We also evaluate the mapping time, i.e., the total runtime for building the initial map and offline optimizing the map. As we compare our system with Hloc using the TartanAir dataset in the map reuse experiment, we use the same baseline and dataset in this experiment. The average mapping time per frame may differ when the map size varies, therefore, we measure the mapping time with different numbers of input images. The results are presented in Fig. 13b, where \(n \times\) means our system is \(n\) times faster than Hloc. It can be seen that our system is much more efficient than Hloc, especially as the input images increase. Besides, Hloc can only use monocular images to build a map without the real scale, and the map only contains point features, while our system can build the point-line map and estimate the real scale using a stereo camera and an IMU. Therefore, our system is more stable and practical for robotics applications than Hloc.
-
映射效率:我们还评估了映射时间,即构建初始地图和离线优化地图的总运行时间。在地图复用实验中,我们将系统与使用TartanAir数据集的Hloc进行比较,因此在本实验中我们使用相同的基准和数据集。当地图大小不同时,每帧的平均映射时间可能会有所不同,因此我们测量了不同数量输入图像的映射时间。结果如图13b所示,其中\(n \times\)表示我们的系统比Hloc快\(n\)倍。可以看出,我们的系统比Hloc效率高得多,尤其是在输入图像增加时。此外,Hloc只能使用单目图像构建没有实际尺度的地图,并且地图仅包含点特征,而我们的系统可以使用立体相机和IMU构建点-线地图并估计实际尺度。因此,与Hloc相比,我们的系统更稳定,更适用于机器人应用。
-
Embedded Platform: We use 8 sequences in the EuRoC dataset to evaluate the efficiency of AirSLAM on the embedded platform. The suffixes, i.e., "-Jetson" and "-PC", are added to distinguish results on different platforms. On the Jetson, we modify three parameters in our system to improve efficiency. First, we reduced the number of detected keypoints from 350 to 300. Second, we change two parameters in Section V-D to make keyframes sparser,i.e., \({\alpha }_{1}\) and \({\alpha }_{2}\) are changed from 0.65 and 0.1 to 0.5 and 0.2 , respectively. The other parameters are the same on these two platforms. The comparisons of efficiency and absolute trajectory error (ATE) are presented in Table VII and Fig. 14, respectively. Our AirSLAM can run at a rate of \({40}\mathrm{\;{Hz}}\) on the Jetson while only consuming \(2\mathrm{{CPU}}\) cores and 989MB GPU memory. We find the runtime of the offline map optimization is very close on these two platforms. This is because AirSLAM-Jetson selects fewer keyframes than AirSLAM-PC, so the loop closure and GBA are faster.
-
嵌入式平台:我们使用 EuRoC 数据集中的 8 个序列来评估 AirSLAM 在嵌入式平台上的效率。后缀,即 "-Jetson" 和 "-PC",用于区分不同平台上的结果。在 Jetson 上,我们修改了系统中的三个参数以提高效率。首先,我们将检测到的关键点数量从 350 减少到 300。其次,我们更改了第五部分 D 节中的两个参数以使关键帧更稀疏,即 \({\alpha }_{1}\) 和 \({\alpha }_{2}\) 分别从 0.65 和 0.1 更改为 0.5 和 0.2。其他参数在这两个平台上相同。效率和绝对轨迹误差 (ATE) 的比较分别在表 VII 和图 14 中展示。我们的 AirSLAM 可以在 Jetson 上以 \({40}\mathrm{\;{Hz}}\) 的速率运行,同时仅消耗 \(2\mathrm{{CPU}}\) 个核心和 989MB GPU 内存。我们发现离线地图优化的运行时间在这两个平台上非常接近。这是因为 AirSLAM-Jetson 选择的关键帧比 AirSLAM-PC 少,因此闭环和 GBA 更快。
VIII. 结论
In this work, we present an efficient and illumination-robust hybrid vSLAM system. To be robust to challenging illumination, the proposed system employs the CNN to detect both keypoints and structural lines. Then these two features are associated and tracked using a GNN. To make the system more efficient, we propose PLNet, which is the first unified model to detect both point and line features simultaneously. Furthermore, a multi-stage relocalization method based on both appearance and geometry information is proposed for efficient map reuse. We design the system with an architecture that includes online mapping, offline optimization, and online relocalization, making it easier to deploy on real robots. Extensive experiments show that the proposed system outperforms other SOTA vSLAM systems in terms of accuracy, efficiency, and robustness in illumination-challenging environments.
在本工作中,我们提出了一种高效且对光照鲁棒的混合vSLAM系统。为了对挑战性光照条件具有鲁棒性,所提出的系统采用CNN来检测关键点和结构线。然后,这两种特征通过GNN进行关联和跟踪。为了提高系统的效率,我们提出了PLNet,这是首个能够同时检测点和线特征的统一模型。此外,我们还提出了一种基于外观和几何信息的多阶段重定位方法,以实现高效的地图复用。我们设计了包含在线建图、离线优化和在线重定位的系统架构,使其更易于部署在真实机器人上。大量实验表明,所提出的系统在准确性、效率和对光照挑战环境的鲁棒性方面优于其他SOTA vSLAM系统。
Despite its remarkable performance, the proposed system still has limitations. Like other point-line-based SLAM systems, our AirSLAM relies on enough line features, so it is best to apply it to man-made environments. This is because our system was originally designed for warehouse robots. In the unstructured environments, it will degrade into a point-only system.
尽管该系统性能卓越,但仍存在局限性。与其他基于点和线的SLAM系统类似,我们的AirSLAM依赖于足够的线特征,因此最适合应用于人造环境。这是因为我们的系统最初是为仓库机器人设计的。在不规则环境中,它将退化为仅有点特征的系统。
参考文献
点击查看参考文献
[1] A. Macario Barros, M. Michel, Y. Moline, G. Corre, and F. Carrel, "A comprehensive survey of visual slam algorithms," Robotics, vol. 11, no. 1, p. 24, 2022. [2] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, "ORB-SLAM: a versatile and accurate monocular slam system," IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147-1163, 2015. [3] T. Qin, P. Li, and S. Shen, "Vins-mono: A robust and versatile monocular visual-inertial state estimator," IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004-1020, 2018. [4] R. Gomez-Ojeda, F.-A. Moreno, D. Zuniga-Noël, D. Scaramuzza, and J. Gonzalez-Jimenez, "Pl-slam: A stereo slam system through the combination of points and line segments," IEEE Transactions on Robotics, vol. 35, no. 3, pp. 734-746, 2019. [5] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, "Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age," IEEE Transactions on Robotics, vol. 32, no. 6, pp. 1309-1332, 2016. [6] D. Zuñiga-Noël, A. Jaenal, R. Gomez-Ojeda, and J. Gonzalez-Jimenez, "The uma-vi dataset: Visual-inertial odometry in low-textured and dynamic illumination environments," The International Journal of Robotics Research, vol. 39, no. 9, pp. 1052-1060, 2020. [7] A. Savinykh, M. Kurenkov, E. Kruzhkov, E. Yudin, A. Potapov, P. Karpyshev, and D. Tsetserukou, "Darkslam: Gan-assisted visual slam for reliable operation in low-light conditions," in 2022 IEEE 95th Vehicular Technology Conference:(VTC2022-Spring). IEEE, 2022, pp. $1 - 6$ . [8] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, "From coarse to fine: Robust hierarchical localization at large scale," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12716-12725. [9] C. Toft, W. Maddern, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, T. Pajdla et al., "Long-term visual localization revisited," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 4, pp. 2074-2088, 2020. [10] Q. Gu, P. Liu, J. Zhou, X. Peng, and Y. Zhang, "Drms: Dim-light robust monocular simultaneous localization and mapping," in 2021 International Conference on Computer, Control and Robotics (ICCCR). IEEE, 2021, pp. 267-271. [11] L. Yu, E. Yang, and B. Yang, "Afe-orb-slam: robust monocular vslam based on adaptive fast threshold and image enhancement for complex lighting environments," Journal of Intelligent & Robotic Systems, vol. 105, no. 2, pp. 1-14, 2022. [12] R. Gomez-Ojeda, Z. Zhang, J. Gonzalez-Jimenez, and D. Scaramuzza, "Learning-based image enhancement for visual odometry in challenging hdr environments," in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 805-811. [13] G. G. Scandaroli, M. Meilland, and R. Richa, "Improving ncc-based direct visual tracking," in European conference on Computer Vision. Springer, 2012, pp. 442-455. [14] V. Usenko, N. Demmel, D. Schubert, J. Stückler, and D. Cremers, "Visual-inertial mapping with non-linear factor recovery," IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 422-429, 2019. [15] S. Park, T. Schöps, and M. Pollefeys, "Illumination change robustness in direct visual slam," in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 4523-4530. [16] W. Wang, Y. Hu, and S. Scherer, "Tartanvo: A generalizable learning-based vo," in Conference on Robot Learning. PMLR, 2021, pp. 1761-1772. [17] Z. Teed and J. Deng, "Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras," Advances in neural information processing systems, vol. 34, pp. 16558-16569, 2021. [18] T. Fu, S. Su, Y. Lu, and C. Wang, "iSLAM: Imperative SLAM," IEEE Robotics and Automation Letters (RA-L), 2024. [Online]. Available: https://arxiv.org/pdf/2306.07894.pdf [19] M. Labbé and F. Michaud, "Multi-session visual slam for illumination-invariant re-localization in indoor environments," Frontiers in Robotics and ${AI}$ ,vol. 9,p. 801886,2022. [20] - "Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation," Journal of field robotics, vol. 36, no. 2, pp. 416-446, 2019. [21] K. Xu, Y. Hao, S. Yuan, C. Wang, and L. Xie, "Airvo: An illumination-robust point-line visual odometry," in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3429-3436. [22] S. Kannapiran, N. Bendapudi, M.-Y. Yu, D. Parikh, S. Berman, A. Vora, and G. Pandey, "Stereo visual odometry with deep learning-based point and line feature matching using an attention graph neural network," in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3491-3498. [23] D. DeTone, T. Malisiewicz, and A. Rabinovich, "Superpoint: Self-supervised interest point detection and description," in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 224-236. [24] R. G. Von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall, "Lsd: A line segment detector," Image Processing On Line, vol. 2, pp. 35-55, 2012. [25] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, "Superglue: Learning feature matching with graph neural networks," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938-4947. [26] P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, "Lightglue: Local feature matching at light speed," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17627-17638. [27] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, "ORB: An efficient alternative to sift or surf," in 2011 International conference on computer vision. IEEE, 2011, pp. 2564-2571. [28] D. G. Viswanathan, "Features from accelerated segment test (fast)," in Proceedings of the 10th workshop on image analysis for multimedia interactive services, London, UK, 2009, pp. 6-8. [29] S. Leutenegger, M. Chli, and R. Y. Siegwart, "Brisk: Binary robust invariant scalable keypoints," in 2011 International conference on computer vision. IEEE, 2011, pp. 2548-2555. [30] R. Kang, J. Shi, X. Li, Y. Liu, and X. Liu, "Df-slam: A deep-learning enhanced visual slam system based on deep local features," arXiv preprint arXiv:1901.07223, 2019. [31] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk, "Learning local feature descriptors with triplets and shallow convolutional neural networks." in Bmvc, vol. 1, no. 2, 2016, p. 3. [32] J. Tang, L. Ericson, J. Folkesson, and P. Jensfelt, "Genv2: Efficient correspondence prediction for real-time slam," IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3505-3512, 2019. [33] X. Han, Y. Tao, Z. Li, R. Cen, and F. Xue, "Superpointvo: A lightweight visual odometry based on cnn feature extraction," in 2020 5th International Conference on Automation, Control and Robotics Engineering (CACRE). IEEE, 2020, pp. 685-691. [34] H. M. S. Bruno and E. L. Colombini, "Lift-slam: A deep-learning feature-based monocular visual slam method," Neurocomputing, vol. 455, pp. 97-110, 2021. [35] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, "Lift: Learned invariant feature transform," in European conference on computer vision. Springer, 2016, pp. 467-483. [36] D. Li, X. Shi, Q. Long, S. Liu, W. Yang, F. Wang, Q. Wei, and F. Qiao, "Dxslam: A robust and efficient visual slam system with deep features," in 2020 IEEERSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4958-4965. [37] J. Revaud, C. De Souza, M. Humenberger, and P. Weinzaepfel, "R2d2: Reliable and repeatable detector and descriptor," Advances in neural information processing systems, vol. 32, 2019. [38] M. Tyszkiewicz, P. Fua, and E. Trulls, "Disk: Learning local features with policy gradient," Advances in Neural Information Processing Systems, vol. 33, pp. 14254-14265, 2020. [39] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, "D2-net: A trainable cnn for joint description and detection of local features," in Proceedings of the ieee/cvf conference on computer vision and pattern recognition, 2019, pp. 8092-8101. [40] H. Hu, L. Sackewitz, and M. Lauer, "Joint learning of feature detector and descriptor for visual slam," in 2021 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2021, pp. 928-933. [41] C. Anagnostopoulos, A. S. Lalos, P. Kapsalas, D. Van Nguyen, and C. Stylios, "Reviewing deep learning-based feature extractors in a novel automotive slam framework," in 2023 31st Mediterranean Conference on Control and Automation (MED). IEEE, 2023, pp. 107-112. [42] C. Akinlar and C. Topal, "Edlines: A real-time line segment detector with a false detection control," Pattern Recognition Letters, vol. 32, no. 13, pp. 1633-1642, 2011. [43] Y. He, J. Zhao, Y. Guo, W. He, and K. Yuan, "Pl-vio: Tightly-coupled monocular visual-inertial odometry using point and line features," Sensors, vol. 18, no. 4, p. 1159, 2018. [44] L. Zhou, G. Huang, Y. Mao, S. Wang, and M. Kaess, "Edplvo: Efficient direct point-line visual odometry," in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 7559-7565. [45] D. Zou, Y. Wu, L. Pei, H. Ling, and W. Yu, "Structvio: Visual-inertial odometry with structural regularity of man-made environments," IEEE Transactions on Robotics, vol. 35, no. 4, pp. 999-1013, 2019. [46] Q. Chen, Y. Cao, J. Hou, G. Li, S. Qiu, B. Chen, X. Xue, H. Lu, and J. Pu, "Vpl-slam: A vertical line supported point line monocular slam system," IEEE Transactions on Intelligent Transportation Systems, 2024. [47] Y. Zhou, H. Qi, and Y. Ma, "End-to-end wireframe parsing," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 962-971. [48] R. Pautrat*, J.-T. Lin*, V. Larsson, M. R. Oswald, and M. Pollefeys, "Sold2: Self-supervised occlusion-aware line description and detection," in Computer Vision and Pattern Recognition (CVPR), 2021. [49] N. Xue, T. Wu, S. Bai, F.-D. Wang, G.-S. Xia, L. Zhang, and P. H. Torr, "Holistically-attracted wireframe parsing: From supervised to self-supervised learning," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. [50] J. Engel, V. Koltun, and D. Cremers, "Direct sparse odometry," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611-625, 2017. [51] A. Crivellaro and V. Lepetit, "Robust 3d tracking with descriptor fields," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3414-3421. [52] J. Huang and S. Liu, "Robust simultaneous localization and mapping in low-light environment," Computer Animation and Virtual Worlds, vol. 30, no. 3-4, p. e1895, 2019. [53] P. Kim, H. Lee, and H. J. Kim, "Autonomous flight with robust visual odometry under dynamic lighting conditions," Autonomous Robots, vol. 43, no. 6, pp. 1605-1622, 2019. [54] Z. Chen and C. Heckman, "Robust pose estimation based on normalized information distance," in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 2217-2223. [55] H. Alismail, M. Kaess, B. Browning, and S. Lucey, "Direct visual odometry in low light using binary descriptors," IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 444-451, 2016. [56] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, "Generative adversarial networks: An overview," IEEE signal processing magazine, vol. 35, no. 1, pp. 53-65, 2018. [57] S. Pratap Singh, B. Mazotti, S. Mayilvahanan, G. Li, D. Manish Rajani, and M. Ghaffari, "Twilight slam: A comparative study of low-light visual slam pipelines," arXiv e-prints, pp. arXiv-2304, 2023. [58] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer, "Tartanair: A dataset to push the limits of visual slam," in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909-4916. [59] D. Gálvez-López and J. D. Tardos, "Bags of binary words for fast place recognition in image sequences," IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188-1197, 2012. [60] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, "Orb-slam3: An accurate open-source library for visual, visual-inertial, and multimap slam," IEEE Transactions on Robotics, vol. 37, no. 6, pp. 1874-1890, 2021. [61] A. Rosinol, M. Abate, Y. Chang, and L. Carlone, "Kimera: an open-source library for real-time metric-semantic localization and mapping," in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 1689-1696. [62] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, "Netvlad: Cnn architecture for weakly supervised place recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297-5307. [63] N. Keetha, A. Mishra, J. Karhade, K. M. Jatavallabhula, S. Scherer, M. Krishna, and S. Garg, "Anyloc: Towards universal visual place recognition," IEEE Robotics and Automation Letters, 2023. [64] S. Yan, Y. Liu, L. Wang, Z. Shen, Z. Peng, H. Liu, M. Zhang, G. Zhang, and X. Zhou, "Long-term visual localization with mobile sensors," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17245-17255. [65] R. Pautrat, D. Barath, V. Larsson, M. R. Oswald, and M. Pollefeys, "Deeplsd: Line segment detection and refinement with deep image gradients," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17327-17336. [66] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma, "Learning to parse wireframes in images of man-made environments," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 626-635. [67] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in Medical image computing and computer-assisted intervention-MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234-241. [68] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014. [69] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, "On-manifold preintegration for real-time visual-inertial odometry," IEEE Transactions on Robotics, vol. 33, no. 1, pp. 1-21, 2016. [70] L. Zhang and R. Koch, "An efficient and robust line segment matching approach based on lbd descriptor and pairwise geometric consistency," Journal of visual communication and image representation, vol. 24, no. 7, pp. 794-805, 2013. [71] A. Bartoli and P. Sturm, "Structure-from-motion using lines: Representation, triangulation, and bundle adjustment," Computer vision and image understanding, vol. 100, no. 3, pp. 416-441, 2005. [72] X. Zuo, X. Xie, Y. Liu, and G. Huang, "Robust visual slam with point and line features," in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 1775-1782. [73] Y. Yang, P. Geneva, K. Eckenhoff, and G. Huang, "Visual-inertial odometry with point and line features," in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 2447-2454. [74] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, "g 2 o: A general framework for graph optimization," in 2011 IEEE international conference on robotics and automation. IEEE, 2011, pp. 3607-3613. [75] M. Kasper, S. McGuire, and C. Heckman, "A benchmark for visual-inertial odometry systems employing onboard illumination," in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 5256-5263. [76] D. Gálvez-López and J. D. Tardós, "Bags of binary words for fast place recognition in image sequences," IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188-1197, October 2012. [77] A. Torii, J. Sivic, T. Pajdla, and M. Okutomi, "Visual place recognition with repetitive structures," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 883-890. [78] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii, "Inloc: Indoor visual localization with dense matching and view synthesis," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7199-7209. [79] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla, "24/7 place recognition by view synthesis," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1808- 1817. [80] P. Denis, J. H. Elder, and F. J. Estrada, "Efficient edge-based methods for estimating manhattan frames in urban imagery," in Computer Vision-ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part II 10. Springer, 2008, pp. 197-210. [81] N. Xue, S. Bai, F. Wang, G.-S. Xia, T. Wu, and L. Zhang, "Learning attraction field representation for robust line segment detection," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1595-1603. [82] N. Xue, S. Bai, F.-D. Wang, G.-S. Xia, T. Wu, L. Zhang, and P. H. Torr, "Learning regional attraction for line segment detection," IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 6, pp. 1998-2013, 2019. [83] Y. Xu, W. Xu, D. Cheung, and Z. Tu, "Line segment detection using transformers without edges," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4257-4266. [84] X. Dai, H. Gong, S. Wu, X. Yuan, and Y. Ma, "Fully convolutional line parsing," Neurocomputing, vol. 506, pp. 1-11, 2022. [85] H. Zhang, Y. Luo, F. Qin, Y. He, and X. Liu, "Elsd: Efficient line segment detector and descriptor," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2969-2978. [86] J. Lee and S.-Y. Park, "Plf-vins: Real-time monocular visual-inertial slam with point-line fusion and parallel-line fusion," IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7033-7040, 2021. [87] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, "Keyframe-based visual-inertial odometry using nonlinear optimization," The International Journal of Robotics Research, vol. 34, no. 3, pp. 314-334, 2015. [88] H. Lim, J. Jeon, and H. Myung, "Uv-slam: Unconstrained line-based slam using vanishing points for structural mapping," IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1518-1525, 2022. [89] A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone, "Kimera: From slam to spatial perception with 3d dynamic scene graphs," The International Journal of Robotics Research, vol. 40, no. 12-14, pp. 1510-1546, 2021. [90] P. Geneva, K. Eckenhoff, W. Lee, Y. Yang, and G. Huang, "Openvins: A research platform for visual-inertial estimation," in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 4666-4672. [91] F. Shu, J. Wang, A. Pagani, and D. Stricker, "Structure plp-slam: Efficient sparse mapping and localization using point, line and plane for monocular, rgb-d and stereo cameras," in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 2105-2112. [92] A. Cramariuc, L. Bernreiter, F. Tschopp, M. Fehr, V. Reijgwart, J. Nieto, R. Siegwart, and C. Cadena, "maplab 2.0-a modular and multi-modal mapping framework," IEEE Robotics and Automation Letters, vol. 8, no. 2, pp. 520-527, 2022. [93] Y. Wang, B. Xu, W. Fan, and C. Xiang, "A robust and efficient loop closure detection approach for hybrid ground/aerial vehicles," Drones, vol. 7, no. 2, p. 135, 2023. [94] X. Peng, Z. Liu, W. Li, P. Tan, S. Cho, and Q. Wang, "Dvi-slam: A dual visual inertial slam network," arXiv preprint arXiv:2309.13814, 2023. [95] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, "The euroc micro aerial vehicle datasets," The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157-1163, 2016. [96] M. Grupp, "evo: Python package for the evaluation of odometry and slam." https://github.com/MichaelGrupp/evo, 2017. [97] E. Brachmann, T. Cavallari, and V. A. Prisacariu, "Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5044-5053. [98] J. Revaud, J. Almazán, R. S. Rezende, and C. R. d. Souza, "Learning with average precision: Training image retrieval with a listwise loss," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5107-5116. [99] Y. Ge, H. Wang, F. Zhu, R. Zhao, and H. Li, "Self-supervising fine-grained region similarities for large-scale image localization," in Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV 16. Springer, 2020, pp. 369-386. [100] G. Berton, G. Trivigno, B. Caputo, and C. Masone, "Eigenplaces: Training viewpoint robust models for visual place recognition," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11080-11090. [101] D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International journal of computer vision, vol. 60, pp. 91-110, 2004. [102] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas, "Sosnet: Second order similarity regularization for local descriptor learning," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11016-11025. [103] L. Cavalli, V. Larsson, M. R. Oswald, T. Sattler, and M. Pollefeys, "Adalam: Revisiting handcrafted outlier detection," arXiv preprint arXiv:2006.04250, 2020. Kuan Xu received the B.E. and M.E. degrees in Electrical Engineering from the Harbin Institute of Technology, Harbin, China, in 2016 and 2018, respectively. From July 2018 to March 2020, he worked as a robot algorithm engineer at Tencent Holdings Ltd, Beijing, China. From March 2020 to April 2022, he served as a senior robot algorithm engineer at Geekplus Technology Co., Ltd., Beijing, China. He is currently a Ph.D. student at the Department of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. His research interests include visual SLAM, robot localization, and perception. 徐宽于2016年和2018年分别获得哈尔滨工业大学电气工程学士和硕士学位。自2018年7月至2020年3月,他在中国北京腾讯控股有限公司担任机器人算法工程师。自2020年3月至2022年4月,他在中国北京极智嘉科技有限公司担任高级机器人算法工程师。他目前是新加坡南洋理工大学电气与电子工程系的博士生。他的研究兴趣包括视觉SLAM、机器人定位和感知。 <img src="https://cdn.noedgeai.com/01914684-a7e0-726c-b2f6-d06ecf361578_18.jpg?x=808&y=1111&w=176&h=229"/> Yuefan Hao received a B.E. degree in Communication Engineering from Kunming University of Science and Technology, Kunming, China, in 2017 and a M.E. degree in Electrical and Communication Engineering from University of Electronic Science and Technology of China, Chengdu, China, in 2020. From June 2020 to June 2024, he served as a robot algorithm engineer at Geekplus Technology Co., Ltd., Beijing, China. His research interests include computer vision, deep learning, and robotics. 郝越凡于2017年在中国昆明的昆明理工大学获得通信工程学士学位,并于2020年在成都的电子科技大学获得电气与通信工程硕士学位。自2020年6月至2024年6月,他在中国北京的高科智加科技有限公司担任机器人算法工程师。他的研究兴趣包括计算机视觉、深度学习和机器人技术。 <img src="https://cdn.noedgeai.com/01914684-a7e0-726c-b2f6-d06ecf361578_18.jpg?x=795&y=1698&w=204&h=207"/> Shenghai Yuan received his B.S. and Ph.D. degrees in Electrical and Electronic Engineering in 2013 and 2019, respectively, from Nanyang Technological University, Singapore. His research focuses on robotics perception and navigation. He is a postdoctoral senior research fellow at the Centre for Advanced Robotics Technology Innovation (CARTIN) at Nanyang Technological University, Singapore. He has contributed over 70 papers to journals such as TRO, IJRR, TIE, and RAL, and to conferences including ICRA, CVPR, ICCV, NeurIPS, and IROS. He has also contributed over 10 technical disclosures and patents. Currently, he serves as an associate editor for the Unmanned Systems Journal and as a guest editor for the Electronics Special Issue on Advanced Technologies of Navigation for Intelligent Vehicles. He achieved second place in the academic track of the 2021 Hilti SLAM Challenge, third place in the visual-inertial track of the 2023 ICCV SLAM Challenge, and won the IROS 2023 Best Entertainment and Amusement Paper Award. He also received the Outstanding Reviewer Award at ICRA 2024. He served as the organizer of the CARIC UAV Swarm Challenge and Workshop at the 2023 CDC and the UG2 Anti-drone Challenge and Workshop at CVPR 2024. Currently, he is the organizer of the second CARIC UAV Swarm Challenge and Workshop at IROS 2024. 袁慎海于2013年和2019年分别在新加坡南洋理工大学获得电气与电子工程学士和博士学位。他的研究重点是机器人感知和导航。他是新加坡南洋理工大学先进机器人技术创新中心(CARTIN)的高级博士后研究员。他在TRO、IJRR、TIE和RAL等期刊以及ICRA、CVPR、ICCV、NeurIPS和IROS等会议上发表了超过70篇论文。他还贡献了超过10项技术披露和专利。目前,他担任《无人系统》期刊的副主编和《电子学报》关于智能车辆导航先进技术的特刊客座编辑。他在2021年喜利得SLAM挑战赛学术赛道中获得第二名,在2023年ICCV SLAM挑战赛视觉-惯性赛道中获得第三名,并获得了IROS 2023最佳娱乐与娱乐论文奖。他还获得了ICRA 2024杰出审稿人奖。他曾担任2023年CDC CARIC无人机集群挑战赛和研讨会的组织者,以及CVPR 2024 UG2反无人机挑战赛和研讨会的组织者。目前,他是IROS 2024第二届CARIC无人机集群挑战赛和研讨会的组织者。 <img src="https://cdn.noedgeai.com/01914684-a7e0-726c-b2f6-d06ecf361578_19.jpg?x=121&y=147&w=195&h=245"/> Chen Wang received a B.Eng. degree in Electrical Engineering from Beijing Institute of Technology (BIT) in 2014 and a Ph.D. degree in Electrical Engineering from Nanyang Technological University (NTU) Singapore in 2019. He was a Postdoctoral Fellow with the Robotics Institute at Carnegie Mellon University (CMU) from 2019 to 2022. 王晨于2014年在北京理工大学获得电气工程学士学位,并于2019年在新加坡南洋理工大学获得电气工程博士学位。2019年至2022年,他在卡内基梅隆大学机器人研究所担任博士后研究员。 <img src="https://cdn.noedgeai.com/01914684-a7e0-726c-b2f6-d06ecf361578_19.jpg?x=120&y=749&w=196&h=244"/> Dr. Wang is an Assistant Professor and leading the Spatial AI & Robotics (SAIR) Lab at the Department of Computer Science and Engineering, University at Buffalo (UB). He is an Associate Editor for the International Journal of Robotics Research (IJRR) and IEEE Robotics and Automation Letters (RA-L) and an Associate Co-chair for the IEEE Technical Committee for Computer & Robot Vision. He served as an Area Chair for the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023, 2024, and The Conference on Neural Information Processing Systems (NeurIPS) 2024. His research interests include Spatial AI and Robotics. 王博士是布法罗大学(UB)计算机科学与工程系的助理教授,并领导空间人工智能与机器人(SAIR)实验室。他是国际机器人研究杂志(IJRR)和IEEE机器人与自动化快报(RA-L)的副主编,也是IEEE计算机与机器人视觉技术委员会的联合主席。他曾担任2023年和2024年IEEE/CVF计算机视觉与模式识别会议(CVPR)和2024年神经信息处理系统会议(NeurIPS)的领域主席。他的研究兴趣包括空间人工智能和机器人技术。 <img src="https://cdn.noedgeai.com/01914684-a7e0-726c-b2f6-d06ecf361578_19.jpg?x=125&y=1242&w=191&h=223"/> Lihua Xie (Fellow, IEEE) received the Ph.D. degree in electrical engineering from the University of Newcastle, Australia, in 1992. Since 1992, he has been with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, where he is currently a President's Chair and Director, Center for Advanced Robotics Technology Innovation. He served as the Head of Division of Control and Instrumentation and Co-Director, Delta-NTU Corporate Lab for Cyber-Physical Systems. He held teaching appointments in the Department of Automatic Control, Nanjing University of Science and Technology from 1986 to 1989. Dr Xie's research interests include robust control and estimation, networked control systems, multi-agent networks, and unmanned systems. He is an Editor-in-Chief for Unmanned Systems and has served as Editor of IET Book Series in Control and Associate Editor of a number of journals including IEEE Transactions on Automatic Control, Automatica, IEEE Transactions on Control Systems Technology, IEEE Transactions on Network Control Systems, and IEEE Transactions on Circuits and Systems-II. He was an IEEE Distinguished Lecturer (Jan 2012 - Dec 2014). Dr Xie is Fellow of Academy of Engineering Singapore, IEEE, IFAC, and CAA. 谢丽华(IEEE会士)于1992年获得澳大利亚纽卡斯尔大学电气工程博士学位。自1992年以来,他一直在新加坡南洋理工大学电气与电子工程学院工作,目前担任校长讲席教授和先进机器人技术创新中心主任。他曾担任控制与仪器系主任和Delta-NTU网络物理系统企业实验室的联合主任。1986年至1989年,他在南京理工大学自动控制系任教。谢博士的研究兴趣包括鲁棒控制与估计、网络控制系统、多智能体网络和无人系统。他是《无人系统》的主编,并曾担任IET控制系列书籍的编辑和多个期刊的副主编,包括IEEE自动控制汇刊、自动化学报、IEEE控制系统技术汇刊、IEEE网络控制系统汇刊和IEEE电路与系统-II汇刊。他曾是IEEE杰出讲师(2012年1月至2014年12月)。谢博士是新加坡工程院、IEEE、IFAC和中国自动化学会的会士。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 展开说说关于C#中ORM框架的用法!
2022-08-12 VScode 自定义模块无法 Ctrl + 左键 跳转问题
2022-08-12 VScode 远程连接一直掉线