Visual place recognition is essential for vision-based robot localization and SLAM. Despite the tremendous progress made in recent years, place recognition in changing environments remains challenging. A promising approach to cope with appearance variations is to leverage high-level semantic features like objects or place categories. In this paper, we propose FM-Loc which is a novel image-based localization approach based on Foundation Models that uses the Large Language Model GPT-3 in combination with the Visual-Language Model CLIP to construct a semantic image descriptor that is robust to severe changes in scene geometry and camera viewpoint. We deploy CLIP to detect objects in an image, GPT-3 to suggest potential room labels based on the detected objects, and CLIP again to propose the most likely location label. The object labels and the scene label constitute an image descriptor that we use to calculate a similarity score between the query and database images. We validate our approach on real-world data that exhibit significant changes in camera viewpoints and object placement between the database and query trajectories. The experimental results demonstrate that our method is applicable to a wide range of indoor scenarios without the need for training or fine-tuning.
翻译:视觉环境识别对于基于视觉的机器人定位和SLAM至关重要。尽管近年来取得了巨大进展,但是在变化的环境中进行场所识别仍然具有挑战性。应对外观变化的一种有前途的方法是利用高级语义特征,例如对象或场所类别。本文提出了一种名为FM-Loc的新型基于图像的定位方法,该方法基于基础模型,利用大型语言模型GPT-3结合视觉语言模型CLIP构建具有鲁棒性的语义图像描述符,以适应场景几何和摄像机视角的严重变化。我们使用CLIP检测图像中的对象,使用GPT-3基于检测到的对象建议可能的房间标签,再次使用CLIP提出最可能的位置标签。对象标签和场景标签构成我们用于计算查询和数据库图像之间相似度评分的图像描述符。我们在具有显着的摄像机视角和对象放置变化的真实世界数据上验证了我们的方法。实验结果表明,我们的方法适用于室内场景的广泛范围,无需训练或微调。