Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, when evaluated under out-of-distribution (out-of-dataset) settings for VQA, we observe that these models exhibit poor generalization. We comprehensively evaluate two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also find that in most cases generative models are less susceptible to shifts in data distribution compared to discriminative ones, and that multimodal pretraining is generally helpful for OOD generalization. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.
翻译:大规模多模态数据预训练的视觉-语言(V&L)模型已在各种任务中展示了强大的性能,例如图像字幕和视觉问答(VQA)。通常通过测量这些模型在训练数据的同一分布下的未见过的数据的性能来评估这些模型的质量。然而,当在视觉问答的分布外(分布外数据集)环境下评估时,我们观察到这些模型表现出较差的泛化能力。通过在分类和开放文本生成两种不同情况下进行数据集交叉评估,我们全面评估了两个预训练的V&L模型。我们发现,这些模型倾向于学习解决基准测试,而不是学习VQA任务所需的高级技能。我们还发现,在大多数情况下,与辨别模型相比,生成模型对数据分布的转移具有较小的敏感性,并且大多模式预训练通常有助于分布外泛化。最后,我们重新审视了自动VQA评估指标使用的假设,并经验证明,严格的评估标准会反复惩罚模型的正确响应。