Map environments provide a fundamental medium for representing spatial structure. Understanding how foundation model (FM) agents understand and act in such environments is therefore critical for enabling reliable map-based reasoning and applications. However, most existing evaluations of spatial ability in FMs rely on static map inputs or text-based queries, overlooking the interactive and experience-driven nature of spatial understanding.In this paper, we propose an interactive evaluation framework to analyze how FM agents explore, remember, and reason in symbolic map environments. Agents incrementally explore partially observable grid-based maps consisting of roads, intersections, and points of interest (POIs), receiving only local observations at each step. Spatial understanding is then evaluated using six kinds of spatial tasks. By systematically varying exploration strategies, memory representations, and reasoning schemes across multiple foundation models, we reveal distinct functional roles of these components. Exploration primarily affects experience acquisition but has a limited impact on final reasoning accuracy. In contrast, memory representation plays a central role in consolidating spatial experience, with structured memories particularly sequential and graph-based representations, substantially improving performance on structure-intensive tasks such as path planning. Reasoning schemes further shape how stored spatial knowledge is used, with advanced prompts supporting more effective multi-step inference. We further observe that spatial reasoning performance saturates across model versions and scales beyond a certain capability threshold, indicating that improvements in map-based spatial understanding require mechanisms tailored to spatial representation and reasoning rather than scaling alone.
翻译:地图环境为表征空间结构提供了一种基础媒介。因此,理解基础模型智能体在此类环境中的认知与行动方式,对于实现可靠的地图推理与应用至关重要。然而,现有对基础模型空间能力的大多数评估依赖于静态地图输入或基于文本的查询,忽视了空间理解的交互性与经验驱动特性。本文提出一个交互式评估框架,用以分析基础模型智能体在符号化地图环境中的探索、记忆与推理行为。智能体以增量方式探索由道路、交叉点和兴趣点组成的部分可观测网格地图,每一步仅接收局部观测信息。随后,通过六类空间任务评估其空间理解能力。通过系统性地改变多种基础模型的探索策略、记忆表征与推理方案,我们揭示了这些组件的不同功能角色:探索主要影响经验获取,但对最终推理准确性的影响有限;相比之下,记忆表征在整合空间经验中发挥核心作用,其中结构化记忆(特别是基于序列和图表的表征)能显著提升在路径规划等结构密集型任务上的性能;推理方案进一步塑造了存储空间知识的运用方式,高级提示策略支持更有效的多步推理。我们还观察到,空间推理性能在不同模型版本和规模上存在饱和现象,超越特定能力阈值后提升有限,这表明基于地图的空间理解能力的提升需要针对空间表征与推理机制进行专门优化,而非仅依靠模型规模的扩大。