Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.2%, and +0.73% on Sr3D, Nr3D, and ScanRefer. Code will be released at https://github.com/ZiyuGuo99/ViewRefer3D.
翻译:本文提出了一种新的基于多视图的三维视觉定位框架“ViewRefer”,该框架结合了文本描述和三维模型视图信息,有效地缓解了现有方法中存在的文本模态或三维模态单一的问题。其中,文本模态采用大规模语言模型GPT进行描述扩展,而三维模态采用基于注意力机制的深度神经网络来进行交互。另外,基于多视图原型的学习算法被引入用来增强框架的鲁棒性和符合性。实验表明,“ViewRefer”在三个基准数据集上都取得了最好的性能表现,并在Sr3D,Nr3D和ScanRefer上超过第二名的性能表现分别为+2.8%,+1.2%和+0.73%。代码可在https://github.com/ZiyuGuo99/ViewRefer3D获取。