While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.
翻译:尽管通用具身智能体必须作为一个统一系统运行,但现有方法建立在理解、世界建模与控制相互隔离的模型之上。这种碎片化阻碍了多模态生成能力的统一,并妨碍了从大规模异构数据中学习。本文提出Motus,一个利用现有通用预训练模型与丰富可共享运动信息的统一潜在动作世界模型。Motus采用混合Transformer架构集成三个专家模块(即理解、视频生成与动作),并采用UniDiffuser风格调度器以实现不同建模模式间的灵活切换(包括世界模型、视觉-语言-动作模型、逆动力学模型、视频生成模型及视频-动作联合预测模型)。Motus进一步利用光流学习潜在动作,采用包含三阶段训练流程与六层数据金字塔的方案,从而提取像素级“增量动作”并实现大规模动作预训练。实验表明,Motus在仿真环境(较X-VLA提升15%,较Pi0.5提升45%)与真实场景(提升11%~48%)中均优于现有先进方法,证明所有功能与先验的统一建模能显著提升下游机器人任务性能。