Applying reinforcement learning (RL) to sparse reward domains is notoriously challenging due to insufficient guiding signals. Common RL techniques for addressing such domains include (1) learning from demonstrations and (2) curriculum learning. While these two approaches have been studied in detail, they have rarely been considered together. This paper aims to do so by introducing a principled task phasing approach that uses demonstrations to automatically generate a curriculum sequence. Using inverse RL from (suboptimal) demonstrations we define a simple initial task. Our task phasing approach then provides a framework to gradually increase the complexity of the task all the way to the target task, while retuning the RL agent in each phasing iteration. Two approaches for phasing are considered: (1) gradually increasing the proportion of time steps an RL agent is in control, and (2) phasing out a guiding informative reward function. We present conditions that guarantee the convergence of these approaches to an optimal policy. Experimental results on 3 sparse reward domains demonstrate that our task phasing approaches outperform state-of-the-art approaches with respect to asymptotic performance.
翻译:将强化学习(RL)应用于稀疏奖励领域是非常具有挑战性的,因为缺乏充分的指导信号。应用于这些领域的常见RL技术包括(1)从演示中学习和(2)课程学习。尽管这两种方法已经得到了详细的研究,但很少同时考虑它们。本文旨在引入一种基于任务分阶段的方法,利用演示自动生成课程序列。使用来自(次优)演示的逆RL,我们定义一个简单的初始任务。然后,我们的任务分阶段方法提供了一个框架,逐步增加任务的复杂度,一直到目标任务,在每个分阶段迭代中对RL代理进行调整。考虑了两种分阶段方法:(1)逐渐增加RL代理控制时间步的比例,和(2)逐渐淘汰引导信息奖励函数。我们提出了保证这些方法收敛到最优策略的条件。在3个稀疏奖励领域上的实验结果表明,我们的任务分阶段方法相对于最先进的方法在渐近性能方面表现更好。