In this paper, we address the challenging problem of 3D concept grounding (i.e. segmenting and learning visual concepts) by looking at RGBD images and reasoning about paired questions and answers. Existing visual reasoning approaches typically utilize supervised methods to extract 2D segmentation masks on which concepts are grounded. In contrast, humans are capable of grounding concepts on the underlying 3D representation of images. However, traditionally inferred 3D representations (e.g., point clouds, voxelgrids, and meshes) cannot capture continuous 3D features flexibly, thus making it challenging to ground concepts to 3D regions based on the language description of the object being referred to. To address both issues, we propose to leverage the continuous, differentiable nature of neural fields to segment and learn concepts. Specifically, each 3D coordinate in a scene is represented as a high-dimensional descriptor. Concept grounding can then be performed by computing the similarity between the descriptor vector of a 3D coordinate and the vector embedding of a language concept, which enables segmentations and concept learning to be jointly learned on neural fields in a differentiable fashion. As a result, both 3D semantic and instance segmentations can emerge directly from question answering supervision using a set of defined neural operators on top of neural fields (e.g., filtering and counting). Experimental results show that our proposed framework outperforms unsupervised/language-mediated segmentation models on semantic and instance segmentation tasks, as well as outperforms existing models on the challenging 3D aware visual reasoning tasks. Furthermore, our framework can generalize well to unseen shape categories and real scans.
翻译:在本文中,我们通过查看 RGBD 图像和关于配对的问答的推理,解决3D 概念定位(即分解和学习视觉概念)这一具有挑战性的问题。现有的视觉推理方法通常使用受监督的方法来提取基于概念的 2D 分解面罩。相反,人类能够将概念定位在3D 图像的基本表达面上。然而,传统上推断的 3D 表达面(例如点云、 voxelgrids 和 meshes) 无法灵活地捕捉连续的 3D 特征,从而使它基于所引用对象的语言描述和对 3D 区域进行地面概念挑战。为了解决这两个问题,我们提议利用持续、可变的线性形状的形状遮挡面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面面、直面面面面面面面面面面面面面、直对结果的直演算。