Cross-modal retrieval methods are the preferred tool to search databases for the text that best matches a query image and vice versa. However, image-text retrieval models commonly learn to memorize spurious correlations in the training data, such as frequent object co-occurrence, instead of looking at the actual underlying reasons for the prediction in the image. For image-text retrieval, this manifests in retrieved sentences that mention objects that are not present in the query image. In this work, we introduce ODmAP@k, an object decorrelation metric that measures a model's robustness to spurious correlations in the training data. We use automatic image and text manipulations to control the presence of such object correlations in designated test data. Additionally, our data synthesis technique is used to tackle model biases due to spurious correlations of semantically unrelated objects in the training data. We apply our proposed pipeline, which involves the finetuning of image-text retrieval frameworks on carefully designed synthetic data, to three state-of-the-art models for image-text retrieval. This results in significant improvements for all three models, both in terms of the standard retrieval performance and in terms of our object decorrelation metric. The code is available at https://github.com/ExplainableML/Spurious_CM_Retrieval.
翻译:交叉模态检索方法是搜索数据库以找到与查询图像或文本最匹配的文本或图像的首选工具。然而,图像-文本检索模型通常学习记忆训练数据中的假相关性,比如频繁的对象共现,而不是查看图像中预测的实际根本原因。对于图像-文本检索,这体现在检索到的句子中提到与查询图像中不存在的对象。在这项工作中,我们引入了ODmAP@k,一种对象去相关度量,用于衡量模型对训练数据中假相关性的鲁棒性。我们使用自动图像和文本操作来控制指定测试数据中这种对象相关性的存在。此外,我们的数据合成技术用于解决由于训练数据中语义不相关对象的假相关性而导致的模型偏差问题。我们将建议的流程应用到三种图像-文本检索的最新模型上,这涉及对精心设计的合成数据上的图像-文本检索框架进行finetuning。这导致了所有三个模型的显着改进,无论是以标准检索性能还是以我们的对象去相关性量度为标准。代码可以在https://github.com/ExplainableML/Spurious_CM_Retrieval上找到。