In this paper, we study the cross-modal image retrieval, where the inputs contain a source image plus some text that describes certain modifications to this image and the desired image. Prior work usually uses a three-stage strategy to tackle this task: 1) extract the features of the inputs; 2) fuse the feature of the source image and its modified text to obtain fusion feature; 3) learn a similarity metric between the desired image and the source image + modified text by using deep metric learning. Since classical image/text encoders can learn the useful representation and common pair-based loss functions of distance metric learning are enough for cross-modal retrieval, people usually improve retrieval accuracy by designing new fusion networks. However, these methods do not successfully handle the modality gap caused by the inconsistent distribution and representation of the features of different modalities, which greatly influences the feature fusion and similarity learning. To alleviate this problem, we adopt the contrastive self-supervised learning method Deep InforMax (DIM) to our approach to bridge this gap by enhancing the dependence between the text, the image, and their fusion. Specifically, our method narrows the modality gap between the text modality and the image modality by maximizing mutual information between their not exactly semantically identical representation. Moreover, we seek an effective common subspace for the semantically same fusion feature and desired image's feature by utilizing Deep InforMax between the low-level layer of the image encoder and the high-level layer of the fusion network. Extensive experiments on three large-scale benchmark datasets show that we have bridged the modality gap between different modalities and achieve state-of-the-art retrieval performance.
翻译:在本文中,我们研究跨模式图像检索,输入内容含有源图像,加上一些描述对该图像和理想图像进行某些修改的文本。先前的工作通常使用一个三阶段战略来完成这项任务:(1) 提取投入的特征;(2) 将源图像的特征及其修改文本结合起来,以获得聚合功能;(3) 通过使用深度计量学习,学习理想图像和源图像+修改文本之间的相似度度度度;由于古典图像/文本编码能够了解远程计量学习的有用代表性和共同对等损失功能,足以进行跨模式检索,人们通常通过设计新的组合网络来提高检索准确性。然而,这些方法并不成功处理不同模式的分布和表达性造成的模式差异,这些差异极大地影响特征的融合和相似性学习;为了缓解这一问题,我们采用了对比性自我超强的学习方法。 深层 深层(DIM) 通过加强文本、图像及其组合之间的依赖性差,我们的方法在深度图像的深度分析上缩小了深度的精确度。我们的方法缩小了在高层次的图像和高层之间在共同水平上所使用的模式差距,我们通过利用共同的深度图像模式在共同的深度模型和图像模式之间,从而尽可能地缩小了共同的图像模式和图像模式之间,从而实现共同的深度对比性地展示性地展示的深度的深度对比性地显示了模式在水平上,从而展示性地展示了共同的深度的深度模型和图像模式之间,从而展示了共同性地展示了共同的深度模型的深度的深度的深度的深度的深度的深度的深度的分辨率的分辨率,从而展示。