Image-text matching plays a central role in bridging the semantic gap between vision and language. The key point to achieve precise visual-semantic alignment lies in capturing the fine-grained cross-modal correspondence between image and text. Most previous methods rely on single-step reasoning to discover the visual-semantic interactions, which lacks the ability of exploiting the multi-level information to locate the hierarchical fine-grained relevance. Different from them, in this work, we propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process. Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially. This progressive alignment strategy supplies our model with more complementary and sufficient semantic clues to understand the hierarchical correlations between image and text. The experimental results on two benchmark datasets demonstrate the superiority of our proposed method.
翻译:图像文本匹配在缩小视觉和语言之间的语义差异方面发挥着核心作用。 实现精确视觉和语义一致性的关键点在于捕捉图像和文字之间细微的跨模式对应。 以往方法大多依靠单步推理来发现视觉和语义的相互作用, 缺乏利用多级信息来确定细度相关性的能力。 与它们不同, 我们在此工作中建议建立一个分步骤的等级对齐网络( SSHAN), 将图像文本匹配分解为多步跨模式推理过程。 具体地说, 我们首先在片段一级实现地方对地方对地方的对齐, 之后在上下文一级按顺序进行全球对地和全球对齐。 这个渐进的对齐战略为我们模型提供了更多互补和足够的语义线索, 来理解图像和文字之间的等级相关性。 两个基准数据集的实验结果显示了我们拟议方法的优越性。