We propose a novel learning-based formulation for visual localization of vehicles that can operate in real-time in city-scale environments. Visual localization algorithms determine the position and orientation from which an image has been captured, using a set of geo-referenced images or a 3D scene representation. Our new localization paradigm, named Implicit Pose Encoding (ImPosing), embeds images and camera poses into a common latent representation with 2 separate neural networks, such that we can compute a similarity score for each image-pose pair. By evaluating candidates through the latent space in a hierarchical manner, the camera position and orientation are not directly regressed but incrementally refined. Very large environments force competitors to store gigabytes of map data, whereas our method is very compact independently of the reference database size. In this paper, we describe how to effectively optimize our learned modules, how to combine them to achieve real-time localization, and demonstrate results on diverse large scale scenarios that significantly outperform prior work in accuracy and computational efficiency.
翻译:我们建议为在城市规模环境中实时运行的车辆的视觉定位设计一种新型的基于学习的配置方法。 视觉本地化算法通过一组地理参照图像或立体场景代表来决定摄取图像的位置和方向。 我们的新本地化范式叫做 Inmplicit Pose Encoding(IMPOSing), 嵌入图像和相机会通过两个不同的神经网络形成共同的潜在代表, 这样我们就可以计算每个相片配对的相似度分。 通过以等级方式通过潜在空间对候选人进行评估, 相机的位置和方向不会直接反转, 而是逐步完善。 非常大的环境下迫使竞争者储存地图数据千字节, 而我们的方法与参考数据库的大小不同。 在本文中, 我们描述如何有效地优化我们所学的模块, 如何将这些模块组合起来实现实时本地化, 并展示各种大规模情景的结果, 这些情景在准确性和计算效率方面大大超过先前的工作。