Self-driving vehicles rely on urban street maps for autonomous navigation. In this paper, we introduce Pix2Map, a method for inferring urban street map topology directly from ego-view images, as needed to continually update and expand existing maps. This is a challenging task, as we need to infer a complex urban road topology directly from raw image data. The main insight of this paper is that this problem can be posed as cross-modal retrieval by learning a joint, cross-modal embedding space for images and existing maps, represented as discrete graphs that encode the topological layout of the visual surroundings. We conduct our experimental evaluation using the Argoverse dataset and show that it is indeed possible to accurately retrieve street maps corresponding to both seen and unseen roads solely from image data. Moreover, we show that our retrieved maps can be used to update or expand existing maps and even show proof-of-concept results for visual localization and image retrieval from spatial graphs.
翻译:自驾车辆依靠城市街道地图进行自主导航。 在本文中, 我们引入了 Pix2Map, 这是一种直接从自我视图图像直接推断城市街道地图地形的方法, 以不断更新和扩大现有地图。 这是一项具有挑战性的任务, 因为我们需要直接从原始图像数据中推断复杂的城市道路地形。 本文的主要见解是, 这个问题可以作为跨模式检索, 通过学习图像和现有地图的联合、 跨模式嵌入空间, 以分解的图解形式代表, 将视觉周围的地形布局编码起来。 我们使用 Argoverse 数据集进行实验性评估, 并表明确实有可能从图像数据中准确检索与所见和看不见道路相对应的街道地图。 此外, 我们所检索的地图可以用来更新或扩展现有的地图, 甚至用空间图形显示视觉本地化和图像检索的校验测试结果 。