Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Spatial Reasoning (VSR), a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English (such as: under, in front of, and facing). While using a seemingly simple annotation format, we show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%. We observe that VLMs' by-relation performances have little correlation with the number of training examples and the tested models are in general incapable of recognising relations concerning the orientations of objects.
翻译:空间关系是人类认知的基本部分。然而,它们用各种方式在自然语言中表达,先前的工作表明,当前的视觉-语言模型(VLMs)很难捕捉关系信息。在本文中,我们介绍了一个包含超过10k自然文本-图像对的视觉空间推理(VSR)数据集,其中包含66种英语空间关系(例如:在下面,在前面,面对)。虽然使用了看似简单的注释格式,但我们展示了数据集包括具有挑战性的语言现象,如变化的参照系。我们展示了人类和模型表现之间的巨大差距: 人类上限高于95%,而最先进的模型仅达到大约70%。我们观察到VLMs的按关系表现与训练示例数量几乎没有关联,而测试的模型通常无法识别涉及对象方向的关系。