Cross-modal retrieval has drawn much attention in both computer vision and natural language processing domains. With the development of convolutional and recurrent neural networks, the bottleneck of retrieval across image-text modalities is no longer the extraction of image and text features but an efficient loss function learning in embedding space. Many loss functions try to closer pairwise features from heterogeneous modalities. This paper proposes a method for learning joint embedding of images and texts using an intra-modal constraint loss function to reduce the violation of negative pairs from the same homogeneous modality. Experimental results show that our approach outperforms state-of-the-art bi-directional image-text retrieval methods on Flickr30K and Microsoft COCO datasets. Our code is publicly available: https://github.com/CanonChen/IMC.
翻译:跨模式检索在计算机视觉和自然语言处理领域都引起了很大的注意。随着神经网络的演变和经常性的发展,在图像文本模式之间检索的瓶颈不再是图像和文字特征的提取,而是在嵌入空间方面的高效损失函数学习。许多损失功能试图从多种模式中更接近对立特征。本文件提出一种方法,用以学习将图像和文字联合嵌入,使用一种内部制约损失功能来减少对同一单一模式的负对的违反。实验结果显示,我们的方法在Flickr30K和Microsoft COCO数据集上超过了最先进的双向图像-文字检索方法。我们的代码可以公开查阅:https://github.com/CaonChen/IMC。