Self-driving vehicles rely on urban street maps for autonomous navigation. In this paper, we introduce Pix2Map, a method for inferring urban street map topology directly from ego-view images, as needed to continually update and expand existing maps. This is a challenging task, as we need to infer a complex urban road topology directly from raw image data. The main insight of this paper is that this problem can be posed as cross-modal retrieval by learning a joint, cross-modal embedding space for images and existing maps, represented as discrete graphs that encode the topological layout of the visual surroundings. We conduct our experimental evaluation using the Argoverse dataset and show that it is indeed possible to accurately retrieve street maps corresponding to both seen and unseen roads solely from image data. Moreover, we show that our retrieved maps can be used to update or expand existing maps and even show proof-of-concept results for visual localization and image retrieval from spatial graphs.
翻译:自动驾驶车辆依赖于城市街道地图进行自主导航。本文介绍了Pix2Map,一种直接从自车视角图像中推断城市街道地图拓扑的方法,作为更新和扩展现有地图所需的。这是一项具有挑战性的任务,因为我们需要直接从原始图像数据中推断出复杂的城市道路拓扑。本文的主要见解是,这个问题可以被描述为通过学习一个联合的、交叉模态的嵌入空间来实现图片和现有地图的检索,这些地图表示为离散图形,编码了视觉环境的拓扑布局。我们使用Argoverse数据集进行实验评估,证明了仅从图像数据中准确检索到对应于已见和未见道路的街道地图是完全可能的。此外,我们还展示了我们检索到的地图可以用来更新或扩展现有的地图,并且甚至展示了在空间图形中的视觉定位和图像检索的概念证明结果。