A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.
翻译:世界模型使智能体能够想象、预测并推理世界如何响应其行为而演变,从而进行规划和决策。尽管近期的视频生成模型能够生成逼真的视觉序列,但它们通常以提示到完整视频的方式运行,缺乏因果控制、交互性或实现有目的推理所需的长时域一致性。另一方面,现有的世界建模工作往往局限于特定领域(如物理、游戏或三维场景动态),其深度和可控性有限,且难以泛化到多样化的环境和交互形式。在本研究中,我们提出了PAN,一种通用、可交互且长时域的世界模型,它通过基于历史和自然语言行为的高质量视频仿真来预测未来世界状态。PAN采用生成式潜在预测(GLP)架构,该架构结合了基于大型语言模型(LLM)的自回归潜在动态主干——该主干将仿真建立在广泛的文本知识基础上,并支持以语言指定的行为作为条件——以及视频扩散解码器,后者重建感知细节丰富且时间连贯的视觉观测,从而实现了潜在空间推理(想象)与可实现世界动态(现实)的统一。通过在涵盖多领域的大规模视频-行为对上进行训练,PAN支持具有连贯长期动态的开放领域、行为条件化仿真。大量实验表明,与其他视频生成器和世界模型相比,PAN在行为条件化世界仿真、长时域预测和仿真推理方面表现出色,朝着实现通用世界模型迈出了一步,该模型能够对未来世界状态进行预测性仿真以支持推理与行动。