3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues. However, existing methods either extract the sentence-level features coupling all words or focus more on object names, which would lose the word-level information or neglect other attributes. To alleviate these issues, we present EDA that Explicitly Decouples the textual attributes in a sentence and conducts Dense Alignment between such fine-grained language and point cloud objects. Specifically, we first propose a text decoupling module to produce textual features for every semantic component. Then, we design two losses to supervise the dense matching between two modalities: position alignment loss and semantic alignment loss. On top of that, we further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity. Through experiments, we achieve state-of-the-art performance on two widely-adopted 3D visual grounding datasets, ScanRefer and SR3D/NR3D, and obtain absolute leadership on our newly-proposed task. The source code will be available at https://github.com/yanmin-wu/EDA.
翻译:3D 视觉地面定位的目的是在自由形式的自然语言描述中,用丰富的语义提示,在点云中查找对象。然而,现有的方法要么是提取句级特征,将所有字词合并在一起,要么是更多地关注对象名称,这样会丢失字级信息,或者忽视其他属性。为了缓解这些问题,我们展示EDA,将文字属性明确分解在一个句子中,并在这种细微区分的语言和点云对象之间进行高度对齐。具体地说,我们首先提议一个文本脱钩模块,为每个语义组成部分制作文字特征。然后,我们设计两个损失,以监督两种模式之间的密集匹配:位置对齐丢失和语义对齐损失。此外,我们进一步引入一个新的视觉地面定位任务,将没有对象名称的物体定位,从而能够彻底评估模型的密集对齐能力。我们通过实验,在两个广泛采用的3D视觉地面数据集、ScscanRefer和SR3D/NR3D上实现状态性表现。并在我们的新拟议任务上获得绝对的领导。源代码将可在 http://Asgimb/ED.