Multimodal image-text models have shown remarkable performance in the past few years. However, evaluating their robustness against distribution shifts is crucial before adopting them in real-world applications. In this paper, we investigate the robustness of 9 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (MMI and MOR) for proper evaluations of multimodal models. We hope our extensive study sheds light on new directions for the development of robust multimodal models.
翻译:在过去几年里,多式图像文本模型表现出了显著的绩效。然而,在实际应用中,评价其相对于分布变化的稳健性对于采用这些模型至关重要。在本文中,我们调查了在五种任务(图像文本检索、视觉推理、视觉要求、图像字幕和文本到图像生成)的共同扰动下,9个受欢迎的开放源图像文本模型的稳健性。特别是,我们通过在现有数据集之上应用17个图像扰动和16个文本扰动技术,提出了几种新的多式联运稳健性基准。我们观察到,多式联运模型对图像和文本的扰动,特别是对图像扰动不强。在经过测试的扰动方法中,字符级扰动构成文本最严格的分布变化,而缩放模糊是图像数据的最严重转变。我们还提出了两种新的稳健性衡量标准(MMI和MOR),用于对多式联运模型进行适当的评估。我们希望我们的广泛研究为发展稳健的多式联运模型提供了新的方向。