Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual feature. Such a formulation does not treat each word of a query sentence on par when modeling language to visual attention, therefore prone to neglect words which are less important for sentence embedding but critical for visual grounding. In this paper we propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. The embedding of each word from the query sentence is treated alike by attending to visual pixels individually instead of single holistic sentence embedding. In this way, each word is given equivalent opportunity to adjust the language to vision attention towards the referent target through multiple stacks of transformer decoder layers. We conduct the experiments on RefCOCO, RefCOCO+ and RefCOCOg datasets and the proposed Word2Pix outperforms existing one-stage methods by a notable margin. The results obtained also show that Word2Pix surpasses two-stage visual grounding models, while at the same time keeping the merits of one-stage paradigm namely end-to-end training and real-time inference speed intact.
翻译:将语言查询编码为当前单阶段的视觉地面方法, 将语言解码为在视觉功能融合前嵌入的一个整体句子。 这种配方不会在将语言建模到视觉关注时, 将每个单词的查询句放在等同的位置上处理, 因此容易忽略对判决嵌入不太重要但对视觉地面定位至关重要的单词 。 在本文中, 我们提议 Word2Pix: 一个基于编码器- 解码器变异器结构的单阶段视觉地面网络, 通过文字到像素关注来学习文本到视觉特征对等的对应。 将查询句中的每个单词嵌入一个像素, 而不是单一整体的句子嵌入。 这样, 每个字都有机会通过变异器解码层的多堆堆将语言调整为引用目标。 我们在RefCO、 RefCOCO+ 和 RefCOCOg 数据集上进行实验, 以及提议的 Word2Pix 将现有的单级方法转换成一个显著的边距。 所获得的结果还显示, Word2Pix 超越了两个阶段的同步地面模型, 也就是到一个阶段的末级的地面模型。