Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence (SI). We thus propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and a growing collection of newly curated ones, enabling systematic evaluation of state-of-the-art models. In this report, we conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in SI, yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail the most advanced multimodal models. EASI is an ongoing community effort: we have open-sourced the EASI codebase that provides a one-stop and reproducible solution with standardized interfaces, integrated protocols and prompts that significantly reduce the friction of configuring and running multiple benchmarks; we have also launched an accompanying EASI leaderboard to provide a continually updated snapshot of model performance across the full SI spectrum, accelerating collective progress toward robust SI.
翻译:近年来,多模态模型取得了显著进展。然而,它们在空间理解与推理——这一将通用人工智能锚定于物理世界的关键能力——方面仍表现出明显的局限性。随着据称是迄今为止最强大人工智能模型GPT-5的近期发布,现在正是审视主流模型(GPT、Gemini、Grok、Seed、Qwen和Intern)在通往空间智能道路上所处位置的恰当时机。为此,我们提出EASI,用于对多模态大语言模型的空间智能进行全面评估。EASI构建了一个统一现有基准测试及日益增多的新策划任务的空间任务综合分类体系,从而能够对最先进的模型进行系统性评估。在本报告中,我们跨越八个关键基准测试开展了研究,总成本超过一百亿tokens。我们的实证研究揭示:(1)GPT-5在空间智能方面展现出前所未有的强大能力,然而(2)在广泛的SI任务范围内,其表现仍显著落后于人类水平。此外,我们(3)证明空间智能任务比非空间智能任务暴露出更大的模型能力缺陷,以至于(4)在面对最困难的任务时,闭源模型并未展现出决定性优势。我们还对一系列对人类直观但最先进的多模态模型却无法应对的多样化场景进行了定性评估。EASI是一项持续的社区协作项目:我们已开源EASI代码库,该库提供一站式、可复现的解决方案,包含标准化接口、集成协议和提示模板,能显著降低配置和运行多个基准测试的复杂度;我们还同步推出了EASI排行榜,持续提供涵盖整个空间智能谱系的模型性能快照,以加速向稳健空间智能的集体迈进。