This work addresses visual cross-view metric localization for outdoor robotics. Given a ground-level color image and a satellite patch that contains the local surroundings, the task is to identify the location of the ground camera within the satellite patch. Related work addressed this task for range-sensors (LiDAR, Radar), but for vision, only as a secondary regression step after an initial cross-view image retrieval step. Since the local satellite patch could also be retrieved through any rough localization prior (e.g. from GPS/GNSS, temporal filtering), we drop the image retrieval objective and focus on the metric localization only. We devise a novel network architecture with denser satellite descriptors, similarity matching at the bottleneck (rather than at the output as in image retrieval), and a dense spatial distribution as output to capture multi-modal localization ambiguities. We compare against a state-of-the-art regression baseline that uses global image descriptors. Quantitative and qualitative experimental results on the recently proposed VIGOR and the Oxford RobotCar datasets validate our design. The produced probabilities are correlated with localization accuracy, and can even be used to roughly estimate the ground camera's heading when its orientation is unknown. Overall, our method reduces the median metric localization error by 51%, 37%, and 28% compared to the state-of-the-art when generalizing respectively in the same area, across areas, and across time.
翻译:这项工作涉及室外机器人的视觉跨视图定位。 鉴于地面色彩图像和包含本地环境的卫星补丁, 我们的任务是在卫星补丁中确定地面相机的位置。 相关工作涉及范围传感器的任务( LiDAR, Radar ), 但对于视觉来说, 仅作为初步交叉视图图像检索步骤之后的二级回归步骤。 由于本地卫星补丁也可以在任何粗略本地化之前( 例如, GPS/ GNSSS、 时间过滤) 检索图像的目标, 我们放弃图像检索目标, 并只关注测量本地化。 我们设计了一个新的网络结构, 其密度更强的卫星描述器, 类似于瓶颈点( 而不是图像检索时的输出) 。 相关工作涉及范围更稠密的空间分布, 以产出来捕捉取多模式本地化的模糊性。 我们比较了一个使用全球图像描述解码器的州级回归基线。 最近提议的VIGOR 和牛津机器人定位数据集的定量和定性实验结果验证了我们的设计。 我们制作的准确性与本地化程度相关联, 其总体定位方向为35方向, 区域, 和中间区域 缩缩缩缩缩缩缩缩缩缩缩到地面。