For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques.
翻译:一个自主的代理机构要在测试时实现广泛的用户指定目标,就必须能够学习广泛适用和通用的技能汇编。此外,为了提供必要的通用水平,这些技能必须处理图像等原始感官输入;在本文中,我们提议一种算法,通过将未经监督的代言学习与强化对目标政策的研究相结合,获得这种通用技能。由于测试时可能需要的特定目标事先不为人知,该代理机构必须进行自我监督的“实践”阶段,在那里它可以想象目标和试图实现这些目标。我们学习视觉表达,有三个不同的目的:为自我监督实践取样目标,为原始感官输入提供结构转型,并为实现目标计算奖励信号。我们还提议一个追溯性目标重新标签计划,以进一步提高我们方法的抽样效率。我们的非政策算法足够有效,可以学习关于原始图像观察的政策和现实世界机器人系统的目标,并大大超出先前的技术。