One of the key challenges in visual imitation learning is collecting large amounts of expert demonstrations for a given task. While methods for collecting human demonstrations are becoming easier with teleoperation methods and the use of low-cost assistive tools, we often still require 100-1000 demonstrations for every task to learn a visual representation and policy. To address this, we turn to an alternate form of data that does not require task-specific demonstrations -- play. Playing is a fundamental method children use to learn a set of skills and behaviors and visual representations in early learning. Importantly, play data is diverse, task-agnostic, and relatively cheap to obtain. In this work, we propose to use playful interactions in a self-supervised manner to learn visual representations for downstream tasks. We collect 2 hours of playful data in 19 diverse environments and use self-predictive learning to extract visual representations. Given these representations, we train policies using imitation learning for two downstream tasks: Pushing and Stacking. We demonstrate that our visual representations generalize better than standard behavior cloning and can achieve similar performance with only half the number of required demonstrations. Our representations, which are trained from scratch, compare favorably against ImageNet pretrained representations. Finally, we provide an experimental analysis on the effects of different pretraining modes on downstream task learning.
翻译:视觉模仿学习的关键挑战之一是为某项任务收集大量专家演示。虽然通过远程操作方法和低成本辅助工具的使用,收集人类演示的方法越来越容易。虽然收集人类演示的方法越来越容易,但我们常常需要100-1000演示来学习视觉表现和政策。为了解决这个问题,我们转向一种不需要特定任务演示的替代数据形式 -- -- 游戏。游戏是儿童在早期学习中学习一套技能和行为及视觉表现的一个基本方法。重要的是,游戏数据是多种多样的、任务敏感和相对廉价的。在这项工作中,我们提议以自我监督的方式使用游戏互动来学习下游任务的视觉表现。我们在19个不同的环境中收集了2小时的游戏数据,并利用自我预测性学习来提取视觉表现。鉴于这些表现,我们用模拟学习政策来进行两项下游任务:推进和施压。我们证明我们的视觉表现比标准行为克隆要好,而且能够取得类似的表现,只有所需演示的一半。我们的陈述是用从头等角度来训练的,对下游任务进行对比性分析,最后是用不同式的实验前学习任务。