Complex sequential tasks in continuous-control settings often require agents to successfully traverse a set of "narrow passages" in their state space. Solving such tasks with a sparse reward in a sample-efficient manner poses a challenge to modern reinforcement learning (RL) due to the associated long-horizon nature of the problem and the lack of sufficient positive signal during learning. Various tools have been applied to address this challenge. When available, large sets of demonstrations can guide agent exploration. Hindsight relabelling on the other hand does not require additional sources of information. However, existing strategies explore based on task-agnostic goal distributions, which can render the solution of long-horizon tasks impractical. In this work, we extend hindsight relabelling mechanisms to guide exploration along task-specific distributions implied by a small set of successful demonstrations. We evaluate the approach on four complex, single and dual arm, robotics manipulation tasks against strong suitable baselines. The method requires far fewer demonstrations to solve all tasks and achieves a significantly higher overall performance as task complexity increases. Finally, we investigate the robustness of the proposed solution with respect to the quality of input representations and the number of demonstrations.
翻译:连续控制环境中复杂的顺序任务往往要求代理商在其国家空间成功穿越一系列“狭小通道”。以抽样效率低的奖励方式解决这些任务对现代强化学习(RL)构成挑战,因为问题具有长期的长视特性,而且学习过程中缺乏足够的积极信号。已经应用了各种工具来应对这一挑战。如果有的话,大量的演示可以指导代理商探索。另一方面,重新标签不需要更多的信息来源。然而,现有战略基于任务不可知的目标分布进行探索,这样可以使长视线任务的解决方案变得不切实际。在这项工作中,我们扩大后视标签机制,指导探索以一小组成功演示所隐含的具体任务分布。我们评估四个复杂、单一和双臂、机器人操纵任务的方法与强健的合适基线相对。随着任务复杂性的增加,该方法需要更少的演示来解决所有任务,并实现更高的总体绩效。最后,我们调查拟议解决方案在投入质量和演示数量方面是否稳健。