Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, we observe that these models exhibit poor out-of-distribution (OOD) generalization on the task of VQA. To better understand the underlying causes of poor generalization, we comprehensively investigate performance of two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also argue that in most cases generative models are less susceptible to shifts in data distribution, while frequently performing better on our tested benchmarks. Moreover, we find that multimodal pretraining improves OOD performance in most settings. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.
翻译:在大规模多式联运数据方面经过预先培训的愿景和语言模型(V&L)在诸如图像说明和直观回答(VQA)等各种任务上表现良好。这些模型的质量通常通过在通常与培训数据相同分布的无形数据上衡量其性能来评估。然而,我们注意到,这些模型在VQA的任务方面缺乏对分配外(OOOD)的概括性。为了更好地了解普遍化不足的根本原因,我们通过进行跨数据集评估,全面调查了不同环境(即分类和不限文本生成)下两个预先培训的V &L模型的性能。我们发现,这些模型往往学会解决基准,而不是学习VQA任务所要求的高级技能。我们还认为,在多数情况下,基因模型不太容易发生数据分配的转变,但往往比我们测试的基准做得更好。此外,我们发现,在多数情况下,多式培训前改进了OD的性能。最后,我们重新审视了使用自动VQA评价指标所依据的假设,并用实验性地表明,其严格性一再惩罚正确反应的模式。