In this paper, we propose a robust end-to-end multi-modal pipeline for place recognition where the sensor systems can differ from the map building to the query. Our approach operates directly on images and LiDAR scans without requiring any local feature extraction modules. By projecting the sensor data onto the unit sphere, we learn a multi-modal descriptor of partially overlapping scenes using a spherical convolutional neural network. The employed spherical projection model enables the support of arbitrary LiDAR and camera systems readily without losing information. Loop closure candidates are found using a nearest-neighbor lookup in the embedding space. We tackle the problem of correctly identifying the closest place by correlating the candidates' power spectra, obtaining a confidence value per prospect. Our estimate for the correct place corresponds then to the candidate with the highest confidence. We evaluate our proposal w.r.t. state-of-the-art approaches in place recognition using real-world data acquired using different sensors. Our approach can achieve a recall that is up to 10% and 5% higher than for a LiDAR- and vision-based system, respectively, when the sensor setup differs between model training and deployment. Additionally, our place selection can correctly identify up to 95% matches from the candidate set.
翻译:在本文中, 我们提出一个强大的端到端多式管道, 供感应系统与地图建筑和查询不同的地方识别。 我们的方法是直接使用图像和LIDAR扫描操作, 不需要任何本地特征提取模块。 通过将感应数据投射到单位球体, 我们学习了一个多式描述器, 使用球形共振神经网络, 学习了部分重叠场景的多式描述器。 使用的球形投影模型可以很容易地支持任意的LIDAR和相机系统, 而不会丢失信息。 在嵌入空间中, 发现关闭对象使用近邻的外观。 我们的方法是通过将候选人的能量光谱相连接, 来正确识别最接近的位置, 从而获得每个前景的信任值。 我们的正确位置估计与候选人相对匹配。 我们用使用使用不同传感器获取的真实世界数据来评估我们的提案 w.r. t. 最新的识别方法。 我们的方法可以回顾, 与基于LIDAR和视觉的系统相比, 分别是10 % 和 5 % 。 当设置传感器时, 能够正确选择我们的候选人之间 。