This paper studies the task of matching image and sentence, where learning appropriate representations across the multi-modal data appears to be the main challenge. Unlike previous approaches that predominantly deploy symmetrical architecture to represent both modalities, we propose Saliency-guided Attention Network (SAN) that asymmetrically employs visual and textual attention modules to learn the fine-grained correlation intertwined between vision and language. The proposed SAN mainly includes three components: saliency detector, Saliency-weighted Visual Attention (SVA) module, and Saliency-guided Textual Attention (STA) module. Concretely, the saliency detector provides the visual saliency information as the guidance for the two attention modules. SVA is designed to leverage the advantage of the saliency information to improve discrimination of visual representations. By fusing the visual information from SVA and textual information as a multi-modal guidance, STA learns discriminative textual representations that are highly sensitive to visual clues. Extensive experiments demonstrate SAN can substantially improve the state-of-the-art results on the benchmark Flickr30K and MSCOCO datasets by a large margin.
翻译:本文研究了匹配图像和句子的任务,其中学习多模式数据的适当表述似乎是主要挑战。与以往主要采用对称结构以代表两种模式的方法不同,我们提议,非对称引导关注网络(SAN)使用视觉和文字关注模块,非对称地使用视觉和文字关注模块,学习视觉和语言之间的细微关联。拟议的SAN主要包括三个组成部分:显性检测器、显性加权视觉关注模块(SVA),以及美性引导文本关注模块(STA)。具体地说,显性检测器提供视觉显著信息,作为两个关注模块的指南。SVA旨在利用显性信息优势优势,改善视觉表现的区别。STA通过将视觉信息和文字信息作为一种多模式指导,学习对视觉线索高度敏感的歧视性文字表达。广泛的实验显示SAN可以大大改善基准Flick30K和MSCO数据集的状态。