Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
翻译:当前用于机器人操作的视觉语言动作模型(VLAs)建立在基于大规模但离散的静态网络数据预训练的视觉语言骨干之上。因此,尽管语义泛化能力有所提升,策略必须仅从机器人轨迹中隐式推断复杂的物理动力学和时间依赖性。这种依赖造成了不可持续的数据负担,需要持续、大规模地收集专家数据,以弥补其内在物理理解的缺失。我们认为,虽然视觉语言预训练能有效捕获语义先验,但它对物理因果关系仍然是盲目的。一种更有效的范式是利用视频在预训练过程中同时捕获语义和视觉动态,从而将剩余的低级控制任务隔离出来。为此,我们提出了模仿视频,一种新颖的视频动作模型(VAM),它将一个预训练的大规模互联网视频模型与一个基于流匹配的动作解码器配对,该解码器以视频模型的潜在表示为条件。该解码器充当逆动力学模型(IDM),从视频空间动作计划的潜在表示中生成低级机器人动作。我们广泛的评估表明,我们的方法在模拟和真实世界的机器人操作任务上实现了最先进的性能,相较于传统的VLA架构,样本效率提高了10倍,收敛速度提高了2倍。