Scaling general-purpose manipulation to new robot embodiments remains challenging: each platform typically needs large, homogeneous demonstrations, and end-to-end pixel-to-action pipelines may degenerate under background and viewpoint shifts. Based on previous advances in video-based robot control, we present Vidar, consisting of an embodied video diffusion model as the generalizable prior and a masked inverse dynamics model (MIDM) as the adapter. We leverage a video diffusion model pre-trained at Internet scale, and further continuously pre-train it for the embodied domain using 750K multi-view trajectories collected from three real-world robot platforms. For this embodied pre-training, we introduce a unified observation space that jointly encodes robot, camera, task, and scene contexts. The MIDM module learns action-relevant pixel masks without dense labels, grounding the prior into the target embodiment's action space while suppressing distractors. With only 20 minutes of human demonstrations on an unseen robot (1% of typical data), Vidar outperforms state-of-the-art baselines and generalizes to unseen tasks, backgrounds, and camera layouts. Our results suggest a scalable recipe for "one prior, many embodiments": strong, inexpensive video priors together with minimal on-robot alignment.
翻译:将通用操作能力扩展到新型机器人平台仍具挑战性:每个平台通常需要大量同质演示数据,而端到端的像素到动作流程在背景与视角变化下可能失效。基于视频驱动机器人控制领域的前沿进展,我们提出Vidar系统,其核心由作为可泛化先验的具身视频扩散模型与作为适配器的掩码逆动力学模型(MIDM)构成。我们利用互联网规模预训练的视频扩散模型,并进一步使用从三个真实世界机器人平台采集的75万条多视角轨迹数据,对其进行面向具身领域的持续预训练。在此具身预训练阶段,我们引入了统一观测空间,以联合编码机器人本体、相机配置、任务场景及环境上下文。MIDM模块无需密集标注即可学习动作相关的像素掩码,将先验模型锚定至目标平台的动作空间,同时抑制干扰因素。仅需在未见过的机器人平台上进行20分钟的人类演示(相当于典型数据量的1%),Vidar即能超越现有最优基线方法,并泛化至未见过的任务、背景及相机布局。我们的研究成果为“一个先验,多种平台”提供了可扩展的实现路径:即通过强效且低成本的视频先验模型与极简的机器人端对齐机制相结合。