Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts. We address both these issues with ConceptFusion, a scene representation that is (1) fundamentally open-set, enabling reasoning beyond a closed set of concepts and (ii) inherently multimodal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today's foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping. For more information, visit our project page https://concept-fusion.github.io or watch our 5-minute explainer video https://www.youtube.com/watch?v=rkXgws8fiDs
翻译:建立 3D 环境映像图对于机器人导航、规划和与场景中天体互动至关重要。 将语义概念与 3D 映像相结合的现有方法大多仍局限于封闭式设置: 它们只能说明一定的一组概念, 在培训时间预先定义。 此外, 这些地图只能使用类标签, 或者在最近工作中, 使用文本提示来查询 。 我们用概念Fusion 来解决这两个问题 。 一个场景演示是 (1) 基本开放的, 使得推理超越封闭式的一套概念集, 以及 (ii) 内在的多式联运, 使得对 3D 地图进行多种多样的查询, 从语言、 图像、 音频到 3D 几何都局限于封闭式设置 : 概念Fion 利用了今天基础模型的开放设置能力, 预先训练了诸如自然语言、 图像 和 音频 。 我们用传统的 SLM 和多视图 平台将 3D 的设置和 3D 组合方法整合成 3D 3D 。 这有利于有效的空间推理, 真正的空间推理, 解释, 不需要再监督 3F 方向 方向 概念 。