Grounding referring expressions in RGBD image has been an emerging field. We present a novel task of 3D visual grounding in single-view RGBD image where the referred objects are often only partially scanned due to occlusion. In contrast to previous works that directly generate object proposals for grounding in the 3D scenes, we propose a bottom-up approach to gradually aggregate context-aware information, effectively addressing the challenge posed by the partial geometry. Our approach first fuses the language and the visual features at the bottom level to generate a heatmap that coarsely localizes the relevant regions in the RGBD image. Then our approach conducts an adaptive feature learning based on the heatmap and performs the object-level matching with another visio-linguistic fusion to finally ground the referred object. We evaluate the proposed method by comparing to the state-of-the-art methods on both the RGBD images extracted from the ScanRefer dataset and our newly collected SUNRefer dataset. Experiments show that our method outperforms the previous methods by a large margin (by 11.2% and 15.6% Acc@0.5) on both datasets.
翻译:RGBD 图像中的地面参照表达式是一个新兴领域。 我们在单视图 RGBD 图像中展示了一个新的任务, 即3D 视觉地面定位, 被指对象往往由于隔热而仅部分扫描。 与先前直接生成立体场落地对象提案的工程相比, 我们提议了一种自下而上的方法, 逐步汇总背景意识信息, 有效地应对部分几何构成的挑战。 我们的方法首先在底层将语言和视觉特征结合到一个热映像中, 以产生一种热映像, 使 RGBD 图像中的相关区域局部化。 然后, 我们的方法根据热映进行适应性特征学习, 并进行对象级的匹配, 与另一个面语言融合匹配, 以最终覆盖被指对象。 我们通过比较从 Scamprefer 数据集提取的 RGBD 图像和我们新收集的 SUNRefer 数据集上的最新艺术方法, 评估了拟议方法。 实验显示, 我们的方法在两个数据设置上都以大边距( 11. 和 15. 15. cc@ 0. 5) 将先前的方法比了前方法。