Word2Pix: 视觉地面中的单词到像素交叉注意变换器 (Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding)

Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual feature. Such a formulation does not treat each word of a query sentence on par when modeling language to visual attention, therefore prone to neglect words which are less important for sentence embedding but critical for visual grounding. In this paper we propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. The embedding of each word from the query sentence is treated alike by attending to visual pixels individually instead of single holistic sentence embedding. In this way, each word is given equivalent opportunity to adjust the language to vision attention towards the referent target through multiple stacks of transformer decoder layers. We conduct the experiments on RefCOCO, RefCOCO+ and RefCOCOg datasets and the proposed Word2Pix outperforms existing one-stage methods by a notable margin. The results obtained also show that Word2Pix surpasses two-stage visual grounding models, while at the same time keeping the merits of one-stage paradigm namely end-to-end training and real-time inference speed intact.

翻译：将语言查询编码为当前单阶段的视觉地面方法, 将语言解码为在视觉功能融合前嵌入的一个整体句子。这种配方不会在将语言建模到视觉关注时, 将每个单词的查询句放在等同的位置上处理, 因此容易忽略对判决嵌入不太重要但对视觉地面定位至关重要的单词。在本文中, 我们提议 Word2Pix: 一个基于编码器- 解码器变异器结构的单阶段视觉地面网络, 通过文字到像素关注来学习文本到视觉特征对等的对应。将查询句中的每个单词嵌入一个像素, 而不是单一整体的句子嵌入。这样, 每个字都有机会通过变异器解码层的多堆堆将语言调整为引用目标。我们在RefCO、 RefCOCO+ 和 RefCOCOg 数据集上进行实验, 以及提议的 Word2Pix 将现有的单级方法转换成一个显著的边距。所获得的结果还显示, Word2Pix 超越了两个阶段的同步地面模型, 也就是到一个阶段的末级的地面模型。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

自然语言处理中的注意力机制，Attention in Natural Language Processing

专知会员服务

136+阅读 · 2020年5月30日

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

专知会员服务

96+阅读 · 2020年4月18日

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

专知会员服务

12+阅读 · 2020年3月13日