图像文本匹配的基于空间政策梯度的注意 (Discrete-continuous Action Space Policy Gradient-based Attention for Image-Text Matching)

Image-text matching is an important multi-modal task with massive applications. It tries to match the image and the text with similar semantic information. Existing approaches do not explicitly transform the different modalities into a common space. Meanwhile, the attention mechanism which is widely used in image-text matching models does not have supervision. We propose a novel attention scheme which projects the image and text embedding into a common space and optimises the attention weights directly towards the evaluation metrics. The proposed attention scheme can be considered as a kind of supervised attention and requiring no additional annotations. It is trained via a novel Discrete-continuous action space policy gradient algorithm, which is more effective in modelling complex action space than previous continuous action space policy gradient. We evaluate the proposed methods on two widely-used benchmark datasets: Flickr30k and MS-COCO, outperforming the previous approaches by a large margin.

翻译：图像文本匹配是一项与大规模应用相匹配的重要的多模式任务。它试图将图像和文本与类似的语义信息相匹配。现有方法并不明确将不同模式转化为共同空间。同时, 在图像文本匹配模型中广泛使用的注意机制没有监督作用。我们提出一个新的关注方案, 将图像和文本嵌入一个共同空间, 并选择直接对评价指标的注意权重。拟议的关注方案可以被视为一种受监督的注意, 不需要额外的说明。它通过新的、模糊的连续行动空间政策梯度算法来培训, 这在模拟复杂行动空间方面比以往连续行动空间政策梯度更有效。我们对两种广泛使用的基准数据集( Flick30k 和 MS-CO) 的拟议方法进行了评估, 大大超过了以往的方法。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【DeepMind】基于模型的强化学习，174页ppt，Model-Based Reinforcement Learning

专知会员服务

89+阅读 · 2021年1月12日

【图神经网络多模态检索】Multi-Modal Retrieval using Graph Neural Networks

专知会员服务

30+阅读 · 2020年10月9日

【基于模型的强化学习的博弈论框架】A Game Theoretic Framework for Model Based Reinforcement Learning

专知会员服务

131+阅读 · 2020年4月19日

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

专知会员服务

12+阅读 · 2020年3月13日