We present a novel framework for probing and improving relational, compositional and contextual understanding of large visual-language models (V+L). While large V+L models have achieved success in various downstream tasks, it is not clear if they have a conceptual grasp of the content. We propose a novel benchmarking dataset for probing three aspects of content understanding. Our probes are grounded in cognitive science and help determine if a V+L model can, for example, determine if snow garnished with a man is implausible, or if it can identify beach furniture by knowing it is located on a beach. We have experimented with 5 well known models, such as CLIP and ViLT, and found that they mostly fail to demonstrate a conceptual understanding. That said, we find interesting insights such as cross-attention helps learning conceptual understanding. We use these insights to propose a new finetuning technique that rewards the three conceptual understanding measures we proposed. We hope that the presented benchmarks will help the community assess and improve the conceptual understanding capabilities of large V+L models.
翻译:我们提出了一种新的框架,用于检测和提高大型视觉-语言模型(V+L)的关系、组合和上下文理解能力。虽然大型V+L模型在各种下游任务中取得了成功,但它们是否对内容有概念性把握还不清楚。我们提出了一种新的基准测试数据集,用于检测内容理解的三个方面。我们的探针基于认知科学,并帮助确定V+L模型是否可以确定例如雪上出现一个男人是不合理的,或者通过了解家具位于海滩上来识别海滩家具。我们尝试了5个著名模型,如CLIP和ViLT,并发现它们大多数都未能展现出概念理解能力。尽管如此,我们发现了有趣的见解,例如跨注意力有助于学习概念理解。我们使用这些见解提出了一种新的微调技术,该技术奖励我们提出的三个概念理解措施。我们希望提供的基准测试将有助于社区评估和改进大型V+L模型的概念理解能力。