Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.2%, and +0.73% on Sr3D, Nr3D, and ScanRefer. Code will be released at https://github.com/ZiyuGuo99/ViewRefer3D.
翻译:摘要: 掌握多视角输入下的三维场景对于解决三维视觉定位中的视角不一致问题已被证明是有用的。然而,现有方法通常忽略嵌入在文本模态中的视角线索,并且无法衡量不同视角的相对重要性。本文提出了ViewRefer,这是一个多视点三维视觉定位框架,可以探索如何从文本和三维模态中掌握视角知识。对于文本分支,ViewRefer利用大规模语言模型(例如GPT)的不同语言知识,将单个定位文本扩展为多个几何一致的描述。同时,在三维模态中,引入一种变换器融合模块,具有视角间的相互关注功能,以提高跨视图对象之间的交互。在此基础上,我们进一步提出了一组可学习的多视图原型,用于记忆不同视角的场景无关知识,并从两个方面增强了该框架:一种用于更稳健文本特征的视角指导注意模块,以及一种在最终预测期间用于评分的视角指导策略。通过我们设计的模型,ViewRefer在三个基准测试中均取得了优异的性能,并在Sr3D、Nr3D和ScanRefer上超过了第二名的+2.8%,+1.2%和+0.73%。代码将在https://github.com/ZiyuGuo99/ViewRefer3D发布。