As a fundamental and challenging task in bridging language and vision domains, Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality, and its key challenge is to measure the semantic similarity across different modalities. Although significant progress has been achieved, existing approaches typically suffer from two major limitations: (1) It hurts the accuracy of the representation by directly exploiting the bottom-up attention based region-level features where each region is equally treated. (2) It limits the scale of negative sample pairs by employing the mini-batch based end-to-end training mechanism. To address these limitations, we propose a Unified Semantic Enhancement Momentum Contrastive Learning (USER) method for ITR. Specifically, we delicately design two simple but effective Global representation based Semantic Enhancement (GSE) modules. One learns the global representation via the self-attention algorithm, noted as Self-Guided Enhancement (SGE) module. The other module benefits from the pre-trained CLIP module, which provides a novel scheme to exploit and transfer the knowledge from an off-the-shelf model, noted as CLIP-Guided Enhancement (CGE) module. Moreover, we incorporate the training mechanism of MoCo into ITR, in which two dynamic queues are employed to enrich and enlarge the scale of negative sample pairs. Meanwhile, a Unified Training Objective (UTO) is developed to learn from mini-batch based and dynamic queue based samples. Extensive experiments on the benchmark MSCOCO and Flickr30K datasets demonstrate the superiority of both retrieval accuracy and inference efficiency. Our source code will be released at https://github.com/zhangy0822/USER.
翻译:作为沟通语言和视觉领域的一项根本性和具有挑战性的任务,图像-文字检索系统(ITR)旨在寻找与来自其他模式的给定查询具有内在相关性的目标实例,其主要挑战是测量不同模式的语义相似性。虽然已经取得显著进展,但现有方法通常受到两大限制:(1) 通过直接利用基于最低关注的、每个区域都得到同等待遇的区域层面特点,使代表的准确性受损;(2) 通过使用基于小型批量的端对端培训机制,限制负面样本对比的精度。为了应对这些局限性,我们提议为IMTR采用统一语义增强动动动动动动动动动动动作学习(USER)方法。我们精细地设计了两个基于语义增强(GSE)的简单而有效的全球代表性模块。通过自控算算算算法(称为自控增强(SGE)模块。另一个模块来自经过培训的CLIP 模块,该模块提供了探索和转移动态智能智能智能(MIIP) 将基于IMGE的升级(MLA) 和升级的模块纳入基于我们的内部数据。