Recent advances in vision-and-language modeling have seen the development of Transformer architectures that achieve remarkable performance on multimodal reasoning tasks. Yet, the exact capabilities of these black-box models are still poorly understood. While much of previous work has focused on studying their ability to learn meaning at the word-level, their ability to track syntactic dependencies between words has received less attention. We take a first step in closing this gap by creating a new multimodal task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup. We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably, with some models performing relatively well and others at chance level. In an effort to explain this variability, our analyses indicate that the quality (and not only sheer quantity) of pretraining data is essential. Additionally, the best performing models leverage fine-grained multimodal pretraining objectives in addition to the standard image-text matching objectives. This study highlights that targeted and controlled evaluations are a crucial step for a precise and rigorous test of the multimodal knowledge of vision-and-language models.
翻译:近期在视觉和语言建模方面取得的进步表明,已经发展出在多式联运推理任务上取得显著成绩的变革型结构,然而,这些黑盒模型的确切能力仍然没有得到很好的理解。虽然以前的许多工作侧重于研究其在字级上学习意义的能力,但其跟踪言词之间综合依赖性的能力受到的注意较少。我们在缩小这一差距方面迈出了第一步,我们创建了一个新的多式联运任务,目的是评价对受控装置的上游-无依赖性的理解。我们评估了各种最新模型,发现它们在这项任务上的绩效差异很大,有些模型表现得相对较好,另一些模型则处于偶然水平。为了解释这种差异性,我们的分析表明,培训前数据的质量(而不仅仅是数量)至关重要。此外,最佳执行模式除了标准图像-文字匹配目标外,还利用了微量的多式联运培训前目标。本研究报告强调,有针对性和有控制的评价是准确和严格测试视觉和语言模型多式知识的关键一步。