Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent large-scale text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. To benchmark existing models, we introduce a large-scale challenge dataset SR2D that contains sentences describing two objects and the spatial relationship between them. We construct and harness an automated evaluation pipeline that employs computer vision to recognize objects and their spatial relationships, and we employ it in a large-scale evaluation of T2I models. Our experiments reveal a surprising finding that, although recent state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations such as left/right/above/below. Our analyses demonstrate several biases and artifacts of T2I models such as the difficulty with generating multiple objects, a bias towards generating the first object mentioned, spatially inconsistent outputs for equivalent relationships, and a correlation between object co-occurrence and spatial understanding capabilities. We conduct a human study that shows the alignment between VISOR and human judgment about spatial understanding. We offer the SR2D dataset and the VISOR metric to the community in support of T2I spatial reasoning research.
翻译:空间理解是计算机视觉的一个基本方面,是人类图像推理的基本内容,是人类图像层面推理的基本组成部分,因此,它是基础语言理解的重要组成部分。虽然最近大规模文本到图像合成(T2I)模型在光现实化方面表现出前所未有的进步,但尚不清楚它们是否具有可靠的空间理解能力。我们调查T2I模型在天体间建立正确空间关系的能力,并展示VISOR,这是衡量图像中文本描述的空间关系的准确性的评价指标。为对现有模型进行基准,我们引入了大规模挑战数据集SR2D,其中载有描述两个对象及其间空间关系的句子。我们建造并使用一个自动评价管道,利用计算机视觉识别天体及其空间关系,我们在对T2I模型进行大规模评估时使用它。我们的实验揭示了一个令人惊讶的发现,尽管最新的T2I模型展示了高图像质量,但它们生成多个天体天体物体或特定空间关系,如左/右/下方/下方等,它们的能力严重受限。我们的分析表明,关于T2I的天体和天体-D群之间对天体之间目标的偏偏向和天体关系,从而产生多空间判断力的模型。