Visual localization, i.e., camera pose estimation in a known scene, is a core component of technologies such as autonomous driving and augmented reality. State-of-the-art localization approaches often rely on image retrieval techniques for one of two tasks: (1) provide an approximate pose estimate or (2) determine which parts of the scene are potentially visible in a given query image. It is common practice to use state-of-the-art image retrieval algorithms for these tasks. These algorithms are often trained for the goal of retrieving the same landmark under a large range of viewpoint changes. However, robustness to viewpoint changes is not necessarily desirable in the context of visual localization. This paper focuses on understanding the role of image retrieval for multiple visual localization tasks. We introduce a benchmark setup and compare state-of-the-art retrieval representations on multiple datasets. We show that retrieval performance on classical landmark retrieval/recognition tasks correlates only for some but not all tasks to localization performance. This indicates a need for retrieval approaches specifically designed for localization tasks. Our benchmark and evaluation protocols are available at https://github.com/naver/kapture-localization.
翻译:视觉定位,即照相机在已知的场景中进行估计,是自主驱动和增强现实等技术的核心组成部分。 最先进的本地化方法往往依赖图像检索技术来完成以下两项任务之一:(1) 提供近似表面估计或(2) 确定场景中哪些部分在特定查询图像中可能可见。 通常的做法是使用最先进的图像检索算法来完成这些任务。 这些算法往往经过培训,目的是在一系列大范围的视图变化下检索同一里程碑。 但是,在视觉本地化方面,对变化的观察力不一定是可取的。 本文的重点是了解图像检索对于多重本地化任务的作用。 我们引入了基准设置,比较了多个数据集上最先进的检索表达方式。 我们显示,传统地标检索/识别任务的业绩仅与某些任务相关,但并非全部任务与本地化业绩相关。这表明需要为本地化任务专门设计的检索方法。我们的基准和评估程序可在 https://github.com/naver/kapturaliz-localization查阅。