Recent works have shown that Large Language Models (LLMs) can be applied to ground natural language to a wide variety of robot skills. However, in practice, learning multi-task, language-conditioned robotic skills typically requires large-scale data collection and frequent human intervention to reset the environment or help correcting the current policies. In this work, we propose a novel approach to efficiently learn general-purpose language-conditioned robot skills from unstructured, offline and reset-free data in the real world by exploiting a self-supervised visuo-lingual affordance model, which requires annotating as little as 1% of the total data with language. We evaluate our method in extensive experiments both in simulated and real-world robotic tasks, achieving state-of-the-art performance on the challenging CALVIN benchmark and learning over 25 distinct visuomotor manipulation tasks with a single policy in the real world. We find that when paired with LLMs to break down abstract natural language instructions into subgoals via few-shot prompting, our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches. Code and videos are available at http://hulc2.cs.uni-freiburg.de
翻译:最近的工作表明,大型语言模型(LLMS)可以应用于广泛的机器人技能,但在实践中,学习多任务、有语言条件的机器人技能通常需要大规模数据收集和频繁的人类干预,以重置环境或帮助纠正现行政策。在这项工作中,我们提出一种新的方法,以便从没有结构的、离线的和没有设置的数据中有效地学习通用语言的机器人技能,在现实世界中,利用一种自我监督的相对语言负担能力模型,这需要用语言来说明数据总量的1%。我们在模拟和现实世界机器人任务的广泛实验中评估我们的方法,在CALVIN基准中实现最先进的表现,在现实世界中学习超过25项不同的面动操纵任务。我们发现,当LMS与通过微小的提示将抽象的自然语言指令破碎成子目标时,我们的方法能够完成长视距、多层次的任务,在现实世界中,我们的方法可以完成长视线、多层次的任务,而在现实世界中,我们需要比先前的Comomotomoto 2.需要更少的数据级的顺序。