Robust robot localization is an important prerequisite for navigation planning. If the environment map was created from different sensors, robot measurements must be robustly associated with map features. In this work, we extend Monte Carlo Localization using vision-language features. These open-vocabulary features enable to robustly compute the likelihood of visual observations, given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds. The abstract vision-language features enable to associate observations and map elements from different modalities. Global localization can be initialized by natural language descriptions of the objects present in the vicinity of locations. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes.
翻译:鲁棒的机器人定位是导航规划的重要前提。若环境地图由不同传感器构建,则必须将机器人测量数据与地图特征进行鲁棒关联。本研究利用视觉-语言特征扩展了蒙特卡洛定位方法。这些开放词汇特征能够根据相机位姿以及由带位姿RGB-D图像或对齐点云构建的3D地图,鲁棒计算视觉观测的似然度。抽象的视觉-语言特征支持跨模态的观测与地图元素关联。全局定位可通过自然语言描述位置周边存在的物体进行初始化。我们使用Matterport3D和Replica数据集在室内场景评估方法性能,并在SemanticKITTI数据集上验证室外场景的泛化能力。