Image-text matching is an important multi-modal task with massive applications. It tries to match the image and the text with similar semantic information. Existing approaches do not explicitly transform the different modalities into a common space. Meanwhile, the attention mechanism which is widely used in image-text matching models does not have supervision. We propose a novel attention scheme which projects the image and text embedding into a common space and optimises the attention weights directly towards the evaluation metrics. The proposed attention scheme can be considered as a kind of supervised attention and requiring no additional annotations. It is trained via a novel Discrete-continuous action space policy gradient algorithm, which is more effective in modelling complex action space than previous continuous action space policy gradient. We evaluate the proposed methods on two widely-used benchmark datasets: Flickr30k and MS-COCO, outperforming the previous approaches by a large margin.
翻译:图像文本匹配是一项与大规模应用相匹配的重要的多模式任务。 它试图将图像和文本与类似的语义信息相匹配。 现有方法并不明确将不同模式转化为共同空间。 同时, 在图像文本匹配模型中广泛使用的注意机制没有监督作用。 我们提出一个新的关注方案, 将图像和文本嵌入一个共同空间, 并选择直接对评价指标的注意权重。 拟议的关注方案可以被视为一种受监督的注意, 不需要额外的说明。 它通过新的、 模糊的连续行动空间政策梯度算法来培训, 这在模拟复杂行动空间方面比以往连续行动空间政策梯度更有效。 我们对两种广泛使用的基准数据集( Flick30k 和 MS-CO) 的拟议方法进行了评估, 大大超过了以往的方法。