基于重新思考的跨模态图像-文本检索基准 (Rethinking Benchmarks for Cross-modal Image-text Retrieval)

Image-text retrieval, as a fundamental and important branch of information retrieval, has attracted extensive research attentions. The main challenge of this task is cross-modal semantic understanding and matching. Some recent works focus more on fine-grained cross-modal semantic matching. With the prevalence of large scale multimodal pretraining models, several state-of-the-art models (e.g. X-VLM) have achieved near-perfect performance on widely-used image-text retrieval benchmarks, i.e. MSCOCO-Test-5K and Flickr30K-Test-1K. In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching. The reason is that a large amount of images and texts in the benchmarks are coarse-grained. Based on the observation, we renovate the coarse-grained images and texts in the old benchmarks and establish the improved benchmarks called MSCOCO-FG and Flickr30K-FG. Specifically, on the image side, we enlarge the original image pool by adopting more similar images. On the text side, we propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort. Furthermore, we evaluate representative image-text retrieval models on our new benchmarks to demonstrate the effectiveness of our method. We also analyze the capability of models on fine-grained semantic comprehension through extensive experiments. The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding, especially in distinguishing attributes of close objects in images. Our code and improved benchmark datasets are publicly available at: https://github.com/cwj1412/MSCOCO-Flikcr30K_FG, which we hope will inspire further in-depth research on cross-modal retrieval.

翻译：摘要：作为信息检索的一个基本重要分支，图像-文本检索引发了广泛的研究关注。这项任务的主要挑战是跨模态语义理解和匹配。一些最近的工作更注重精细的跨模态语义匹配。随着大规模多模态预训练模型的普及，一些最先进的模型（例如X-VLM）已经在广泛使用的图像-文本检索基准测试中取得了近乎完美的性能，例如 MSCOCO-Test-5K 和 Flickr30K-Test-1K。在这篇论文中，我们回顾了两个常见的基准测试，并观察到它们无法评估模型在精细跨模态语义匹配方面的真实能力。原因是基准测试中许多图像和文本是粗粒度的。基于这个观察结果，我们重构了旧基准测试中的粗粒度图像和文本，并建立了改进的基准测试，称为 MSCOCO-FG 和 Flickr30K-FG。具体而言，在图像方面，我们采用了更多相似的图像来扩大原始图像库。在文本方面，我们提出了一种新颖的半自动重构方法，将粗粒度的句子改进为细粒度的句子，几乎不需要人工操作。此外，我们在新的基准测试上评估了代表性的图像-文本检索模型，以证明我们方法的有效性。我们还通过大量的实验分析了模型在精细语义理解方面的能力。结果表明，即使是最先进的模型在精细跨模态语义理解方面仍有很大的改进空间，特别是在区分图像中近似对象的属性方面。我们的代码和改进的基准测试数据集已经公开发布，网址为：https://github.com/cwj1412/MSCOCO-Flikcr30K_FG。我们希望这能激发更深入的跨模态检索研究。