A long-standing goal in robotics is to build robots that can perform a wide range of daily tasks from perceptions obtained with their onboard sensors and specified only via natural language. While recently substantial advances have been achieved in language-driven robotics by leveraging end-to-end learning from pixels, there is no clear and well-understood process for making various design choices due to the underlying variation in setups. In this paper, we conduct an extensive study of the most critical challenges in learning language conditioned policies from offline free-form imitation datasets. We further identify architectural and algorithmic techniques that improve performance, such as a hierarchical decomposition of the robot control learning, a multimodal transformer encoder, discrete latent plans and a self-supervised contrastive loss that aligns video and language representations. By combining the results of our investigation with our improved model components, we are able to present a novel approach that significantly outperforms the state of the art on the challenging language conditioned long-horizon robot manipulation CALVIN benchmark. We have open-sourced our implementation to facilitate future research in learning to perform many complex manipulation skills in a row specified with natural language. Codebase and trained models available at http://hulc.cs.uni-freiburg.de
翻译:机器人的长期目标是从机载传感器上获得的感知和仅以自然语言具体规定的感知中建立能够履行广泛的日常任务的机器人。虽然最近通过利用像素的端到端学习,在语言驱动机器人方面取得了长足的进步,但是由于设置的内在差异,在作出各种设计选择方面没有明确和非常清楚的程序。在本文件中,我们广泛研究了学习由离线自由成形仿真数据集提供的有条件语言政策方面最关键的挑战。我们进一步确定了改进性能的建筑和算法技术,例如机器人控制学习的等级分解、多式联运变异器编码器、离散潜伏计划以及自监督的对比损失,这些损失与视频和语言的表述相匹配。通过将我们的调查结果与改进的模型组成部分结合起来,我们可以提出一种新颖的方法,大大超越了具有挑战性语言条件的远程光成机器人操纵CALVIN基准的艺术现状。我们已公开地将我们的实施软件包罗,以便利未来研究如何进行多种复杂操作技能的研发。我们所训练的ACT-co-code,在行中,我们已指定了一套精通的软件。