Step- Wise 图像文本匹配的梯级级对齐网络 (Step-Wise Hierarchical Alignment Network for Image-Text Matching)

Image-text matching plays a central role in bridging the semantic gap between vision and language. The key point to achieve precise visual-semantic alignment lies in capturing the fine-grained cross-modal correspondence between image and text. Most previous methods rely on single-step reasoning to discover the visual-semantic interactions, which lacks the ability of exploiting the multi-level information to locate the hierarchical fine-grained relevance. Different from them, in this work, we propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process. Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially. This progressive alignment strategy supplies our model with more complementary and sufficient semantic clues to understand the hierarchical correlations between image and text. The experimental results on two benchmark datasets demonstrate the superiority of our proposed method.

翻译：图像文本匹配在缩小视觉和语言之间的语义差异方面发挥着核心作用。实现精确视觉和语义一致性的关键点在于捕捉图像和文字之间细微的跨模式对应。以往方法大多依靠单步推理来发现视觉和语义的相互作用, 缺乏利用多级信息来确定细度相关性的能力。与它们不同, 我们在此工作中建议建立一个分步骤的等级对齐网络( SSHAN), 将图像文本匹配分解为多步跨模式推理过程。具体地说, 我们首先在片段一级实现地方对地方对地方的对齐, 之后在上下文一级按顺序进行全球对地和全球对齐。这个渐进的对齐战略为我们模型提供了更多互补和足够的语义线索, 来理解图像和文字之间的等级相关性。两个基准数据集的实验结果显示了我们拟议方法的优越性。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【图神经网络多模态检索】Multi-Modal Retrieval using Graph Neural Networks

专知会员服务

30+阅读 · 2020年10月9日

【ICLR 2019】双曲注意力网络，Hyperbolic Attention Network

专知会员服务

84+阅读 · 2020年6月21日

学习具有层次标签的图像表示，Learning Representations For Images With Hierarchical Labels

专知会员服务

38+阅读 · 2020年4月6日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日