This paper tackles an emerging and challenging vision-language task, namely 3D visual grounding on point clouds. Many recent works benefit from Transformer with the well-known attention mechanism, leading to a tremendous breakthrough for this task. However, we find that they realize the achievement by using various pre-training or multi-stage processing. To simplify the pipeline, we carefully investigate 3D visual grounding and summarize three fundamental problems about how to develop an end-to-end model with high performance for this task. To address these problems, we especially introduce a novel Hierarchical Attention Model (HAM), offering multi-granularity representation and efficient augmentation for both given texts and multi-modal visual inputs. Extensive experimental results demonstrate the superiority of our proposed HAM model. Specifically, HAM ranks first on the large-scale ScanRefer challenge, which outperforms all the existing methods by a significant margin. Codes will be released after acceptance.
翻译:本文涉及一项新兴的、具有挑战性的视觉语言任务,即3D在点云上进行视觉定位。许多最近的工作都得益于使用众所周知的注意机制的变异器,从而导致这一任务的巨大突破。然而,我们发现,它们通过使用各种培训前或多阶段处理来实现这一成就。为了简化管道,我们仔细调查3D视觉定位,并总结关于如何为这项任务开发一个具有高性能的端对端模型的三个基本问题。为了解决这些问题,我们特别采用了一个新的等级关注模式(HAM),为特定文本和多模式视觉投入提供多语种代表性和有效的增强。广泛的实验结果显示了我们拟议的HAM模型的优越性。具体地说,HAM排在大规模ScampRefer挑战的首位,这种挑战大大超越了现有的所有方法。在接受后,代码将予公布。