Image-text retrieval is one of the major tasks of cross-modal retrieval. Several approaches for this task map images and texts into a common space to create correspondences between the two modalities. However, due to the content (semantics) richness of an image, redundant secondary information in an image may cause false matches. To address this issue, this paper presents a semantic optimization approach, implemented as a Visual Semantic Loss (VSL), to assist the model in focusing on an image's main content. This approach is inspired by how people typically annotate the content of an image by describing its main content. Thus, we leverage the annotated texts corresponding to an image to assist the model in capturing the main content of the image, reducing the negative impact of secondary content. Extensive experiments on two benchmark datasets (MSCOCO and Flickr30K) demonstrate the superior performance of our method. The code is available at: https://github.com/ZhangXu0963/VSL.
翻译:图像文本检索是跨模态检索的主要任务之一。一些方法将图像和文本映射到一个共同的空间中,以创建两种模式之间的对应关系。但是,由于图像的内容(语义)丰富,其中冗余的次要信息可能会导致错误的匹配。为解决这个问题,本文提出一个语义优化方法,实现为视觉语义损失(VSL),以帮助模型专注于图像的主要内容。此方法受人们通常如何通过描述图像的主要内容来注释图像内容的启发。因此,我们利用与图像对应的注释文本来帮助模型捕捉图像的主要内容,减小次要内容的负面影响。对两个基准数据集(MSCOCO 和 Flickr30K)进行的大量实验证明了我们方法的卓越性能。源代码可在https://github.com/ZhangXu0963/VSL上获得。