三维空间长时距视觉生成与导航的空中世界模型 (Aerial World Model for Long-horizon Visual Generation and Navigation in 3D Space)

Weichen Zhang,Peizhi Tang,Xin Zeng,Fanhang Man,Shiquan Yu,Zichao Dai,Baining Zhao,Hongjin Chen,Yu Shang,Wei Wu,Chen Gao,Xinlei Chen,Xin Wang,Yong Li,Wenwu Zhu

Unmanned aerial vehicles (UAVs) have emerged as powerful embodied agents. One of the core abilities is autonomous navigation in large-scale three-dimensional environments. Existing navigation policies, however, are typically optimized for low-level objectives such as obstacle avoidance and trajectory smoothness, lacking the ability to incorporate high-level semantics into planning. To bridge this gap, we propose ANWM, an aerial navigation world model that predicts future visual observations conditioned on past frames and actions, thereby enabling agents to rank candidate trajectories by their semantic plausibility and navigational utility. ANWM is trained on 4-DoF UAV trajectories and introduces a physics-inspired module: Future Frame Projection (FFP), which projects past frames into future viewpoints to provide coarse geometric priors. This module mitigates representational uncertainty in long-distance visual generation and captures the mapping between 3D trajectories and egocentric observations. Empirical results demonstrate that ANWM significantly outperforms existing world models in long-distance visual forecasting and improves UAV navigation success rates in large-scale environments.

翻译：无人机已成为强大的具身智能体。其核心能力之一是在大规模三维环境中的自主导航。然而，现有的导航策略通常针对避障和轨迹平滑等低层目标进行优化，缺乏将高层语义信息融入规划的能力。为弥合这一差距，我们提出了ANWM——一种空中导航世界模型，该模型能够基于历史帧与动作预测未来的视觉观测，从而使智能体能够依据语义合理性与导航效用对候选轨迹进行排序。ANWM在四自由度无人机轨迹上进行训练，并引入一个受物理学启发的模块：未来帧投影（FFP），该模块将历史帧投影至未来视点以提供粗略的几何先验。此模块缓解了长距离视觉生成中的表征不确定性，并捕捉了三维轨迹与以自我为中心的观测之间的映射关系。实验结果表明，ANWM在长距离视觉预测任务上显著优于现有世界模型，并提升了大范围环境中无人机导航的成功率。