Image-text retrieval (ITR) is a challenging task in the field of multimodal information processing due to the semantic gap between different modalities. In recent years, researchers have made great progress in exploring the accurate alignment between image and text. However, existing works mainly focus on the fine-grained alignment between image regions and sentence fragments, which ignores the guiding significance of context background information. Actually, integrating the local fine-grained information and global context background information can provide more semantic clues for retrieval. In this paper, we propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval. First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively. Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module, which enhances the semantic corresponding relations between the local and global information, and obtains more accurate feature representations for the image and text modalities. Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment. To justify the proposed model, we perform extensive experiments on MS-COCO and Flickr30K datasets. Experimental results show that the proposed HGAN outperforms the state-of-the-art methods on both datasets, which demonstrates the effectiveness and superiority of our model.
翻译:图像文本检索( ITR) 是多式联运信息处理领域一项具有挑战性的任务, 原因是不同模式之间的语义差异。 近年来, 研究人员在探索图像和文本之间的准确匹配方面取得了巨大进展。 但是, 现有工作主要侧重于图像区域和句号碎片之间的精细匹配, 忽视了背景背景资料的指导意义。 事实上, 整合本地精选信息和全球背景背景信息可以为检索提供更多的语义线索。 本文中, 我们提议建立一个用于图像文本检索的新型高层次图表调整网络( HGAN ) 。 首先, 为了获取全面的多式联运特征, 我们分别为图像和文本模式建立特征图表。 然后, 建立一个多层次共享空间, 由设计多层次的图像区域与句号组合和重新排列模块( MFAR) 来增强本地和全球信息之间的语义对应关系, 并为图像和文本模式获得更准确的特征描述。 最后, 最终的图像和文本特征正在通过三个层次的图像和文本模型进行进一步的改进, 分别为图像和文本模式构建图像和文本模式的功能, 展示了我们提议的高层次的模型和高层次的模型。