调查图像检索对视觉本地化的作用 -- -- 一个详尽无遗的基准 (Investigating the Role of Image Retrieval for Visual Localization -- An exhaustive benchmark)

Visual localization, i.e., camera pose estimation in a known scene, is a core component of technologies such as autonomous driving and augmented reality. State-of-the-art localization approaches often rely on image retrieval techniques for one of two purposes: (1) provide an approximate pose estimate or (2) determine which parts of the scene are potentially visible in a given query image. It is common practice to use state-of-the-art image retrieval algorithms for both of them. These algorithms are often trained for the goal of retrieving the same landmark under a large range of viewpoint changes which often differs from the requirements of visual localization. In order to investigate the consequences for visual localization, this paper focuses on understanding the role of image retrieval for multiple visual localization paradigms. First, we introduce a novel benchmark setup and compare state-of-the-art retrieval representations on multiple datasets using localization performance as metric. Second, we investigate several definitions of "ground truth" for image retrieval. Using these definitions as upper bounds for the visual localization paradigms, we show that there is still sgnificant room for improvement. Third, using these tools and in-depth analysis, we show that retrieval performance on classical landmark retrieval or place recognition tasks correlates only for some but not all paradigms to localization performance. Finally, we analyze the effects of blur and dynamic scenes in the images. We conclude that there is a need for retrieval approaches specifically designed for localization paradigms. Our benchmark and evaluation protocols are available at https://github.com/naver/kapture-localization.

翻译：视觉本地化,即照相机在已知的场景中进行估计,是各种技术的核心组成部分,例如自主驱动和增强现实。最先进的本地化方法通常依赖于图像检索技术,有两个目的之一:(1) 提供近似表面估计,或(2) 确定在特定查询图像中哪些部分可能可见到场景。通常的做法是使用最先进的图像检索算法对两者进行检索。这些算法往往经过培训,目的是在一系列的视觉本地化要求下重新获取同一地标。为了调查视觉本地化的后果,本文侧重于了解图像检索在多个视觉本地化范例中的作用。首先,我们采用新颖的基准设置,比较多个数据集中的最新检索形式。其次,我们用“地面真相”的几种定义来检索图像。将这些定义用作视觉本地化模式的上层框,我们显示,在改进视觉本地化的后果方面,我们仍然有视觉化空间评估室。第三,我们采用新的基准设置基准设置的本地化工具, 和精确性分析,我们最后用这些地方化分析, 我们用这些地方化工具, 来分析。