Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled "envision-then-plan" pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision-action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real-world testing, AstraNav-World demonstrated exceptional zero-shot capabilities, adapting to previously unseen scenarios without any real-world fine-tuning. These results suggest that AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics, rather than merely overfitting to simulation-specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.
翻译:在开放动态环境中进行具身导航,需要准确预测世界如何随时间演变以及行动将如何展开。我们提出AstraNav-World——一种端到端的世界模型,能够在统一的概率框架内联合推理未来视觉状态与行动序列。该框架将基于扩散的视频生成器与视觉语言策略相结合,实现预测场景与规划行动同步更新的协同推演。训练过程优化两个互补目标:生成行动条件化的多步视觉预测,以及基于这些预测视觉推导轨迹。这种双向约束使视觉预测具备可执行性,同时确保决策立足于物理一致且任务相关的未来状态,从而缓解了传统解耦式“先推演后规划”流程中常见的累积误差。在多样化具身导航基准测试中,实验表明轨迹精度与成功率均获得提升。消融研究证实了紧密的视觉-行动耦合与统一训练的必要性,移除任一分支均会导致预测质量与策略可靠性的下降。在真实世界测试中,AstraNav-World展现出卓越的零样本能力,无需任何真实世界微调即可适应前所未见的场景。这些结果表明,AstraNav-World能够捕捉可迁移的空间理解能力及与规划相关的导航动态特性,而非仅仅过拟合于仿真特定的数据分布。总体而言,通过将前瞻视觉与控制统一于单一生成模型,我们朝着构建在开放真实世界环境中稳健运行的可靠、可解释、通用型具身智能体迈出了关键一步。