Robotic arm manipulation in data-scarce settings is a highly challenging task due to the complex embodiment dynamics and diverse contexts. Recent video-based approaches have shown great promise in capturing and transferring the temporal and physical interactions by pre-training on Internet-scale video data. However, such methods are often not optimized for the embodiment-specific closed-loop control, typically suffering from high latency and insufficient grounding. In this paper, we present Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control), a novel autoregressive embodied video diffusion approach augmented by a masked inverse dynamics model. By grounding video predictions with action-relevant masks and incorporating real-time feedback through cached autoregressive generation, Vidarc achieves fast, accurate closed-loop control. Pre-trained on one million cross-embodiment episodes, Vidarc surpasses state-of-the-art baselines, achieving at least a 15% higher success rate in real-world deployment and a 91% reduction in latency. We also highlight its robust generalization and error correction capabilities across previously unseen robotic platforms.
翻译:在数据稀缺场景下的机械臂操作是一项极具挑战性的任务,这源于复杂的具身动力学和多样化的环境。近期基于视频的方法通过在互联网规模视频数据上进行预训练,在捕捉和迁移时空与物理交互方面展现出巨大潜力。然而,此类方法通常未针对具身特定的闭环控制进行优化,普遍存在高延迟和感知基础不足的问题。本文提出Vidarc(用于动作推理与闭环控制的视频扩散模型),这是一种新颖的自回归具身视频扩散方法,并通过一个掩码逆动力学模型进行增强。通过使用与动作相关的掩码为视频预测提供感知基础,并借助缓存的自回归生成融入实时反馈,Vidarc实现了快速、准确的闭环控制。在一百万个跨具身交互片段上预训练后,Vidarc超越了现有最先进的基线方法,在实际部署中取得了至少高出15%的成功率,并将延迟降低了91%。我们还强调了其在先前未见过的机器人平台上强大的泛化能力和纠错能力。