Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for "thinking with images" (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics-perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision-language models, demonstrating strong and generalizable spatial understanding.
翻译:当前的空间智能方法通常将三维线索附加到二维推理流程中,或将多模态大语言模型与黑盒重建模块耦合,导致空间一致性弱、视角多样性有限,且证据链无法追溯至支撑视图。“图像思维”框架(如ChatGPT-o3和DeepEyes)表明,通过假设形成与视觉证据主动获取的交错,可逐步实现多模态推理,但未能解决空间思维链中的三个关键挑战:在严格令牌预算下构建全局空间感知、将三维假设与视频帧显式关联以进行验证,以及设计空间接地的强化学习奖励。为解决这些问题,我们提出EagleVision,一种通过宏观感知与微观验证实现渐进式空间认知的双阶段框架。在宏观感知阶段,EagleVision采用语义-视角融合行列式点过程(SPF-DPP),在固定令牌预算下从长视频中选择一组紧凑的几何与语义感知关键帧。在微观验证阶段,我们将空间思维链形式化为BEV接地的姿态查询:智能体迭代预测BEV平面上的姿态,检索最接近的真实帧,并完全通过强化学习进行训练,其空间接地奖励根据预测姿态与观测视图的一致性进行评分。在VSI-Bench上,EagleVision在开源视觉语言模型中实现了最先进的性能,展现出强大且可泛化的空间理解能力。