The 3D visual grounding task has been explored with visual and language streams comprehending referential language to identify target objects in 3D scenes. However, most existing methods devote the visual stream to capturing the 3D visual clues using off-the-shelf point clouds encoders. The main question we address in this paper is "can we consolidate the 3D visual stream by 2D clues synthesized from point clouds and efficiently utilize them in training and testing?". The main idea is to assist the 3D encoder by incorporating rich 2D object representations without requiring extra 2D inputs. To this end, we leverage 2D clues, synthetically generated from 3D point clouds, and empirically show their aptitude to boost the quality of the learned visual representations. We validate our approach through comprehensive experiments on Nr3D, Sr3D, and ScanRefer datasets and show consistent performance gains compared to existing methods. Our proposed module, dubbed as Look Around and Refer (LAR), significantly outperforms the state-of-the-art 3D visual grounding techniques on three benchmarks, i.e., Nr3D, Sr3D, and ScanRefer. The code is available at https://eslambakr.github.io/LAR.github.io/.
翻译:3D 视觉定位任务已经与3D 图像流和语言流进行了探索,这些视觉和语言流可以理解3D 场景中的目标对象。 然而,大多数现有方法都将视觉流用于利用现成点云云编码器捕获三维视觉线索。 我们本文处理的主要问题是“我们能否用从点云中合成的2D线索整合三维视觉流,并在培训和测试中有效地利用这些线索?” 。主要想法是帮助 3D 编码器,方法是在不需要额外 2D 投入的情况下纳入丰富的 2D 对象显示器。 为此,我们利用 3D 点云中合成的2D 线索, 以及实验性地展示出他们提高所学视觉显示质量的能力。 我们通过 Nr3D、 Sr3D 和 Scurcerfer 数据集的全面实验来验证我们的方法, 并显示与现有方法相比的连续性成绩。 我们提议的模块, 标为“ 环顾和参照” (LAR), 大大超越了 3D 视觉地面技术在三个基准上的位置, i.e.,, i., Nr3DD. am3D, sur3D.r3D. surview.