Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the \emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, \textbf{Spatial-DISE}, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: \textbf{I}ntrinsic-\textbf{S}tatic, Intrinsic-\textbf{D}ynamic, \textbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new \textbf{Spatial-DISE} dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code will be publicly released.
翻译:空间推理能力对于视觉语言模型(VLMs)在机器人、增强现实和自主导航等多个领域支持实际应用至关重要。遗憾的是,现有基准在评估空间推理能力方面存在不足,特别是对人类空间认知的基本方面——**内在动态**空间推理的评估。本文提出一个基于认知基础分类法的统一基准**Spatial-DISE**,该分类法将任务划分为四个基本象限:**内**在-**静**态、内在-**动**态、**外**在-静态和外在-动态空间推理。此外,为解决数据稀缺问题,我们开发了一个可扩展的自动化流程来生成多样且可验证的空间推理问题,从而构建了新的**Spatial-DISE**数据集,包含Spatial-DISE Bench(559个评估用VQA对)和Spatial-DISE-12K(12,000多个训练用VQA对)。我们对28个前沿VLM进行的全面评估表明,当前VLM与人类能力存在显著且一致的差距,尤其在多步骤多视角空间推理任务上。Spatial-DISE为未来实现类人空间智能的研究提供了稳健的框架、宝贵的数据集和清晰的方向。基准、数据集及代码将公开发布。