基于合成数据的视觉和语言模型超越名词 (Going Beyond Nouns With Vision & Language Models Using Synthetic Data)

Paola Cascante-Bonilla,Khaled Shehada,James Seale Smith,Sivan Doveh,Donghyun Kim,Rameswar Panda,Gül Varol,Aude Oliva,Vicente Ordonez,Rogerio Feris,Leonid Karlinsky

from arxiv, Project page: https://synthetic-vic.github.io/

Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works have uncovered a fundamental weakness of these models. For example, their difficulty to understand Visual Language Concepts (VLC) that go 'beyond nouns' such as the meaning of non-object words (e.g., attributes, actions, relations, states, etc.), or difficulty in performing compositional reasoning such as understanding the significance of the order of the words in a sentence. In this work, we investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings without compromising their zero-shot capabilities. We contribute Synthetic Visual Concepts (SyViC) - a million-scale synthetic dataset and data generation codebase allowing to generate additional suitable data to improve VLC understanding and compositional reasoning of VL models. Additionally, we propose a general VL finetuning strategy for effectively leveraging SyViC towards achieving these improvements. Our extensive experiments and ablations on VL-Checklist, Winoground, and ARO benchmarks demonstrate that it is possible to adapt strong pre-trained VL models with synthetic data significantly enhancing their VLC understanding (e.g. by 9.9% on ARO and 4.3% on VL-Checklist) with under 1% drop in their zero-shot accuracy.

翻译：大规模预训练视觉和语言（VL）模型在许多应用中表现出出色的性能，实现了用自然语言提示进行零样本开放词汇的推理，可以替换固定的支持类别集合。然而，最近的研究揭示了这些模型的一个根本性弱点。例如，它们很难理解视觉语言概念（VLC），超越名词的含义，如非物体词（例如属性、动作、关系、状态等），或者难以进行组合推理，如理解句子中单词顺序的重要性。在本文中，我们调查了纯合成数据能够在不损害零样本能力的情况下，教会这些模型克服这些缺点的程度。我们贡献了合成可视概念（SyViC）-一个百万级的合成数据集和数据生成代码库，允许生成适当的数据，以提高VL模型对VLC的理解和组合推理。此外，我们提出了一种通用的VL微调策略，有效地利用SyViC来实现这些改进。我们在VL-Checklist、Winoground和ARO基准测试上进行了大量的实验和割舍，证明了可以通过合成数据来适应强大的预训练VL模型，显著提升它们对VLC的理解（例如，在ARO上提高了9.9%、在VL-Checklist上提高了4.3%），而零样本准确度只下降不到1%。