通过 " 视觉语言模型教学增强 " 获取机器人技能 (Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models)

In recent years, much progress has been made in learning robotic manipulation policies that follow natural language instructions. Such methods typically learn from corpora of robot-language data that was either collected with specific tasks in mind or expensively re-labelled by humans with rich language descriptions in hindsight. Recently, large-scale pretrained vision-language models (VLMs) like CLIP or ViLD have been applied to robotics for learning representations and scene descriptors. Can these pretrained models serve as automatic labelers for robot data, effectively importing Internet-scale knowledge into existing datasets to make them useful even for tasks that are not reflected in their ground truth annotations? To accomplish this, we introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL): we utilize semi-supervised language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data and then train language-conditioned policies on the augmented datasets. This method enables cheaper acquisition of useful language descriptions compared to expensive human labels, allowing for more efficient label coverage of large-scale datasets. We apply DIAL to a challenging real-world robotic manipulation domain where 96.5% of the 80,000 demonstrations do not contain crowd-sourced language annotations. DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.

翻译：近年来,在学习遵循自然语言指令的机器人操纵政策方面取得了很大进展。这些方法通常从以特定任务收集的机器人语言数据团体中学习,或者在脑海中以特定任务收集,或者在事后观察中由具有丰富语言描述的人以昂贵的方式重新标注。最近,CLIP或VILD等大规模预先培训的视觉语言模型(VLMs)被应用到机器人学习无标签演示数据的大型数据集和场景描述器上。这些经过预先培训的模型能否作为机器人数据的自动标签,有效地将互联网规模知识输入现有数据集,使其甚至用于其地面真相说明中没有反映的任务。为了实现这一点,我们采用了数据驱动的指令增强语言控制(DIAL):我们使用半超高语言标签,利用CLIP的语义理解将知识传播到无标签演示数据的大型数据集和场景描述器上,然后在强化的数据集上培训有语言限制的政策。这种方法可以比昂贵的原始人类标签更廉价地获取有用的语言描述,从而能够更高效地将大规模数据标签覆盖大型数据库的标签,从而实现大规模40版数据转换的版本的版本的版本的版本。我们的版本的版本的版本,我们把60号的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本学习成为了一种不具有挑战性文件。我们的版本的版本的版本的版本的版本的版本的版本的版本的版本,我们的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的