Visual localization is the task of estimating camera pose in a known scene, which is an essential problem in robotics and computer vision. However, long-term visual localization is still a challenge due to the environmental appearance changes caused by lighting and seasons. While techniques exist to address appearance changes using neural networks, these methods typically require ground-truth pose information to generate accurate image correspondences or act as a supervisory signal during training. In this paper, we present a novel self-supervised feature learning framework for metric visual localization. We use a sequence-based image matching algorithm across different sequences of images (i.e., experiences) to generate image correspondences without ground-truth labels. We can then sample image pairs to train a deep neural network that learns sparse features with associated descriptors and scores without ground-truth pose supervision. The learned features can be used together with a classical pose estimator for visual stereo localization. We validate the learned features by integrating with an existing Visual Teach & Repeat pipeline to perform closed-loop localization experiments under different lighting conditions for a total of 22.4 km.
翻译:视觉本地化是估算在已知场景中显示相机的任务,这是机器人和计算机视觉中的一个基本问题。然而,长期视觉本地化仍然是一项挑战,因为照明和季节导致的环境外观变化。虽然存在使用神经网络解决外观变化的技术,但这些方法通常需要地面真象提供信息,以生成准确的图像对应信息,或在培训期间充当监督信号。在本文中,我们为光学本地化提供了一个新型的自我监督特征学习框架。我们使用基于序列的图像匹配算法,在不同图像序列(即经验)中生成无地面真相标签的图像对应算法。我们随后可以对成对相进行抽样图像配对,以训练深神经网络,在不同的照明条件下,在总共22.4公里的光学条件下,学习与相关描述符和分数的稀疏密特征。所学特征可以与视觉立式本地化的古典化估计仪一起使用。我们通过与现有的视觉教学和重复管道结合来验证所学到的特征,以便在不同的照明条件下进行封闭本地化实验。