In this paper, we propose a transformer based approach for visual grounding. Unlike previous proposal-and-rank frameworks that rely heavily on pretrained object detectors or proposal-free frameworks that upgrade an off-the-shelf one-stage detector by fusing textual embeddings, our approach is built on top of a transformer encoder-decoder and is independent of any pretrained detectors or word embedding models. Termed VGTR -- Visual Grounding with TRansformers, our approach is designed to learn semantic-discriminative visual features under the guidance of the textual description without harming their location ability. This information flow enables our VGTR to have a strong capability in capturing context-level semantics of both vision and language modalities, rendering us to aggregate accurate visual clues implied by the description to locate the interested object instance. Experiments show that our method outperforms state-of-the-art proposal-free approaches by a considerable margin on five benchmarks while maintaining fast inference speed.
翻译:在本文中,我们提出了一种基于变压器的视觉地面定位方法。与以前严重依赖预先训练的物体探测器或无建议框架的变压器框架,即通过使用文字嵌入器升级现成的单级探测器不同,我们的方法建在变压器编码解码器之上,独立于任何预先训练的探测器或字嵌入模型。Stemed VGTR -- -- 与TRanexes一起的视觉定位,我们的方法旨在根据文字描述的指导学习语义和差异性视觉特征,而不会损害其定位能力。这种信息流动使我们的VGTR能够在获取视觉和语言模式的上层语义特征方面拥有强大的能力,从而使我们能够汇总描述所暗示的准确视觉线索,以找到感兴趣的物体实例。实验表明,我们的方法在保持快速推导速度的同时,在5个基准上有很大的幅度,以优于最新无技术的建议方法。