UniVLA：通过任务中心潜在动作学习在任意环境中执行任务 (UniVLA: Learning to Act Anywhere with Task-centric Latent Actions)

A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.

翻译：通用机器人应能在多种环境中高效执行任务。然而，现有方法大多依赖大规模动作标注数据来提升性能，导致其通常局限于单一物理规格，难以在不同具身形态与环境间迁移学习。为突破这些限制，我们提出UniVLA——一种学习跨具身视觉-语言-动作（VLA）策略的新框架。其核心创新在于通过潜在动作模型从视频中提取任务中心的动作表征，从而能够利用广泛具身形态与视角的多样化数据。为减少任务无关动态特征的干扰，我们引入语言指令并在DINO特征空间构建潜在动作模型。通过互联网规模视频训练得到的通用策略，可经高效潜在动作解码部署至各类机器人。我们在多个操作与导航基准测试及真实机器人部署中取得最先进成果：UniVLA以低于OpenVLA 1/20的预训练算力与1/10的下游数据实现更优性能。当训练流程纳入异构数据（包括人类视频）时，模型性能持续提升。这些结果彰显了UniVLA在推动可扩展、高效机器人策略学习方面的潜力。