Semantic 3D scene understanding is a problem of critical importance in robotics. While significant advances have been made in spatial perception, robots are still far from having the common-sense knowledge about household objects and locations of an average human. We thus investigate the use of large language models to impart common sense for scene understanding. Specifically, we introduce three paradigms for leveraging language for classifying rooms in indoor environments based on their contained objects: (i) a zero-shot approach, (ii) a feed-forward classifier approach, and (iii) a contrastive classifier approach. These methods operate on 3D scene graphs produced by modern spatial perception systems. We then analyze each approach, demonstrating notable zero-shot generalization and transfer capabilities stemming from their use of language. Finally, we show these approaches also apply to inferring building labels from contained rooms and demonstrate our zero-shot approach on a real environment. All code can be found at https://github.com/MIT-SPARK/llm_scene_understanding.
翻译:虽然在空间认知方面取得了显著进步,但机器人仍然远远没有掌握关于普通人类的家用物体和地点的常识,因此我们调查了使用大型语言模型来传授对场理解的常识。具体地说,我们引入了三种模式,以利用语言根据室内环境的内装物体进行分类:(一) 零射法,(二) 进料推进分类法,(三) 对比性分类法。这些方法在现代空间认知系统产生的三维场景图上运作。然后我们分析每一种方法,显示其使用语言产生的明显零射一般化和转移能力。最后,我们展示了这些方法也适用于推断封闭室内的建筑标签,并展示了我们在真实环境中的零射方法。所有代码都可以在https://github.com/MIT-SPORK/llm_scene_unidate上找到。