Learning descriptive 3D features is crucial for understanding 3D scenes with diverse objects and complex structures. However, it is usually unknown whether important geometric attributes and scene context obtain enough emphasis in an end-to-end trained 3D scene understanding network. To guide 3D feature learning toward important geometric attributes and scene context, we explore the help of textual scene descriptions. Given some free-form descriptions paired with 3D scenes, we extract the knowledge regarding the object relationships and object attributes. We then inject the knowledge to 3D feature learning through three classification-based auxiliary tasks. This language-assisted training can be combined with modern object detection and instance segmentation methods to promote 3D semantic scene understanding, especially in a label-deficient regime. Moreover, the 3D feature learned with language assistance is better aligned with the language features, which can benefit various 3D-language multimodal tasks. Experiments on several benchmarks of 3D-only and 3D-language tasks demonstrate the effectiveness of our language-assisted 3D feature learning. Code is available at https://github.com/Asterisci/Language-Assisted-3D.
翻译:学习描述 3D 特征对于理解具有不同对象和复杂结构的 3D 场景至关重要,然而,对于重要几何属性和场景背景在经过培训的 3D 场景理解网络中是否获得足够重视,通常不为人知。为了引导 3D 特征学习到重要的几何属性和场景背景,我们探索了文字场景描述的帮助。根据与 3D 场景相配的一些自由形式描述,我们提取了关于对象关系和对象属性的知识。然后,我们通过三个基于分类的辅助任务,将知识注入 3D 特征学习。这种语言辅助培训可以与现代对象探测和实例分割方法相结合,促进3D 语义场景理解,特别是在标签缺失的系统中。此外,用语言协助学习的 3D 3D 特征与语言特征更符合语言特征,这有利于各种 3D 语言多式任务。关于三D 和三D 语言辅助特征学习的若干基准的实验展示了我们语言辅助的3D 3D 特征学习的有效性。代码可在 https://github.com/Asterisci/Language-assed 3D 3D 3D 3D 3D 3D 3D 3。