Grounding object properties and relations in 3D scenes is a prerequisite for a wide range of artificial intelligence tasks, such as visually grounded dialogues and embodied manipulation. However, the variability of the 3D domain induces two fundamental challenges: 1) the expense of labeling and 2) the complexity of 3D grounded language. Hence, essential desiderata for models are to be data-efficient, generalize to different data distributions and tasks with unseen semantic forms, as well as ground complex language semantics (e.g., view-point anchoring and multi-object reference). To address these challenges, we propose NS3D, a neuro-symbolic framework for 3D grounding. NS3D translates language into programs with hierarchical structures by leveraging large language-to-code models. Different functional modules in the programs are implemented as neural networks. Notably, NS3D extends prior neuro-symbolic visual reasoning methods by introducing functional modules that effectively reason about high-arity relations (i.e., relations among more than two objects), key in disambiguating objects in complex 3D scenes. Modular and compositional architecture enables NS3D to achieve state-of-the-art results on the ReferIt3D view-dependence task, a 3D referring expression comprehension benchmark. Importantly, NS3D shows significantly improved performance on settings of data-efficiency and generalization, and demonstrate zero-shot transfer to an unseen 3D question-answering task.
翻译:在三维场景中确定对象属性和关系是人工智能任务的先决条件,例如基于视觉的对话和体验操作。然而,3D领域的可变性引发了两个根本性挑战:1) 标注费用和2) 复杂的3D语言。因此,模型的基本要求是具有数据效率,能够推广到不同的数据分布和任务,以应对未见过的语义形式,并打下复杂的语义基础(例如,视角定位和多对象引用)。为了应对这些挑战,我们提出了NS3D,一种用于3D基础的神经符号化框架。NS3D通过利用大型语言到代码模型将语言转化为具有分层结构的程序。程序中的不同功能模块被实现为神经网络。值得注意的是,NS3D通过引入能有效推理高元关系的功能模块,有效地扩展了先前的神经符号化视觉推理方法(即,超过两个对象之间的关系),这在消除复杂3D场景中的模糊对象方面是关键的。模块化和组合式的架构使NS3D在ReferIt3D视角依赖任务中实现了最领先的结果,这是一个3D指称表达理解基准。重要的是,NS3D在数据效率和泛化的情况下显示出显著的改进,并展示了对未见过的3D问答任务的零-shot转移。