Image-Text Retrieval (ITR) is challenging in bridging visual and lingual modalities. Contrastive learning has been adopted by most prior arts. Except for limited amount of negative image-text pairs, the capability of constrastive learning is restricted by manually weighting negative pairs as well as unawareness of external knowledge. In this paper, we propose our novel Coupled Diversity-Sensitive Momentum Constrastive Learning (CODER) for improving cross-modal representation. Firstly, a novel diversity-sensitive contrastive learning (DCL) architecture is invented. We introduce dynamic dictionaries for both modalities to enlarge the scale of image-text pairs, and diversity-sensitiveness is achieved by adaptive negative pair weighting. Furthermore, two branches are designed in CODER. One learns instance-level embeddings from image/text, and it also generates pseudo online clustering labels for its input image/text based on their embeddings. Meanwhile, the other branch learns to query from commonsense knowledge graph to form concept-level descriptors for both modalities. Afterwards, both branches leverage DCL to align the cross-modal embedding spaces while an extra pseudo clustering label prediction loss is utilized to promote concept-level representation learning for the second branch. Extensive experiments conducted on two popular benchmarks, i.e. MSCOCO and Flicker30K, validate CODER remarkably outperforms the state-of-the-art approaches.
翻译:图像检索工具( ITR) 在连接视觉和语言模式方面具有挑战性。 多数前科艺术都采用了对比性学习。 除了数量有限的负面图像文本配对外, 手工加权负对数以及缺乏外部知识限制了学习能力。 在本文中, 我们提出我们的新颖的“ 多元性-敏感性动态学习( CODER) ”, 以改善跨模式代表制。 首先, 发明了一个新的对多样性敏感的对比学习( DCL) 架构。 我们为两种模式引入了动态词典, 以扩大图像文本配对的规模, 并且通过适应性负对齐实现多样性敏感度。 此外, 两个分支在 CDCDER 中设计了两个分支。 一个从图像/ 文字中学习实例级嵌入式嵌入内容, 并且根据嵌入内容为输入图像/ 文本生成虚拟的在线组合标签。 与此同时, 其它分支学习从通用知识图表的图表图解到两种模式的州级概念描述仪。 后, 将 DCL 和 IM 将 IM 用于 的跨版本的模型化模型化模型 。