Recently, large-scale vision-language pre-training models and visual semantic embedding methods have significantly improved image-text matching (ITM) accuracy on MS COCO 5K test set. However, it is unclear how robust these state-of-the-art (SOTA) models are when using them in the wild. In this paper, we propose a novel evaluation benchmark to stress-test the robustness of ITM models. To this end, we add various fooling images and captions to a retrieval pool. Specifically, we change images by inserting unrelated images, and change captions by substituting a noun, which can change the meaning of a sentence. We discover that just adding these newly created images and captions to the test set can degrade performances (i.e., Recall@1) of a wide range of SOTA models (e.g., 81.9% $\rightarrow$ 64.5% in BLIP, 66.1% $\rightarrow$ 37.5% in VSE$\infty$). We expect that our findings can provide insights for improving the robustness of the vision-language models and devising more diverse stress-test methods in cross-modal retrieval task. Source code and dataset will be available at https://github.com/pseulki/rococo.
翻译:近年来,大规模视觉-语言预训练模型和视觉语义嵌入方法已经显著提高了在 MS COCO 5K 测试数据集上的图像-文本匹配 (ITM) 的准确性。然而,当将它们用于实际情境时,这些最先进的 (SOTA) 模型的鲁棒性不清楚。在本文中,我们提出了一种新的评估基准来压力测试 ITM 模型的鲁棒性。为此,我们在检索池中添加各种易错的图像和标题。具体而言,我们通过插入无关图像来更改图像,通过替换名词来更改标题,从而更改句子的含义。我们发现,仅仅将这些新创建的图像和标题添加到测试集中就可以降低各种 SOTA 模型的性能 (如 BLIP 中的 81.9% $\rightarrow$ 64.5% 和 VSE$\infty$ 中的 66.1% $\rightarrow$ 37.5%)。我们期望我们的发现可以为改善视觉语言模型的鲁棒性以及设计更多样化的跨模式检索任务压力测试方法提供见解。源代码和数据集将在 https://github.com/pseulki/rococo 上提供。