Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Spatial Reasoning (VSR), a dataset containing more than 10k natural text-image pairs with 65 types of spatial relations in English (such as: under, in front of, and facing). While using a seemingly simple annotation format, we show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%. We observe that VLMs' by-relation performances have little correlation with the number of training examples and the tested models are in general incapable of recognising relations concerning the orientations of objects.
翻译:空间关系是人类认知的基本部分。 但是,它们以多种方式以自然语言表达,以往的工作表明,当前视觉和语言模型(VLMs)在捕捉关系信息方面挣扎。 在本文中,我们展示了包含10k以上自然文本图像对和65种英文空间关系的65种自然文本图像数据集(例如:在英文中、在英文中和面对中)。我们使用看似简单的注释格式,我们展示了数据集是如何包括挑战性语言现象的,例如不同的参照框架。我们展示了人类和模型性能之间的巨大差距:人类最高比率超过95%,而最新模型只达到70%左右。我们观察到VLMs的自然文本图像性能与培训实例的数量没有什么关联,测试模型一般无法识别与物体方向的关系。