Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO & Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on retrieval over existing datasets without using the composition and order information. Given that contrastive pretraining optimizes for retrieval on datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality.
翻译:尽管大型视觉语言模型在许多下游应用中取得了成功,但它们对组合信息进行编码的能力尚不清楚。在这里,我们创建了Attribution、Relation和Order(ARO)基准来系统地评估VLMs理解不同类型关系、属性和顺序的能力。ARO包括Visual Genome Attribution以测试对象属性的理解;Visual Genome Relation以测试关系理解;以及COCO和Flickr30k-Order,以测试顺序敏感性。ARO比以前的组成性基准大几个数量级,包括50,000多个测试案例。我们展示了现有技术的VLMs在关系理解方面表现不佳,当将对象链接到它们的属性时可能会出错,并且表现出严重缺乏顺序敏感性。VLMs主要在具有丰富组合结构的图像和标题的大型数据集上训练和评估。然而,对这些数据集进行训练并未足以解决缺乏组合理解的问题,并且在这些数据集上进行评估未能揭示这种缺陷。为了了解为什么这些限制会出现并且在标准测试中没有被表示,我们将重点放在评估和训练程序上。我们证明了在使用组合和顺序信息的情况下可以在现有数据集上进行良好的检索而不使用。考虑到对比度预训练优化的是在使用类似快捷方式的数据集上进行检索,我们假设这可能解释了模型不需要学习表示组合信息的原因。这一发现提出了一种自然的解决方案:组合感知的hard negative mining。我们展示了对比度学习的一种简单实现修改方法,可以显着提高需要理解顺序和组合性的任务的性能。