Reinforcement Learning (RL) agents can learn to solve complex sequential decision making tasks by interacting with the environment. However, sample efficiency remains a major challenge. In the field of multi-goal RL, where agents are required to reach multiple goals to solve complex tasks, improving sample efficiency can be especially challenging. On the other hand, humans or other biological agents learn such tasks in a much more strategic way, following a curriculum where tasks are sampled with increasing difficulty level in order to make gradual and efficient learning progress. In this work, we propose a method for automatic goal generation using a dynamical distance function (DDF) in a self-supervised fashion. DDF is a function which predicts the dynamical distance between any two states within a markov decision process (MDP). With this, we generate a curriculum of goals at the appropriate difficulty level to facilitate efficient learning throughout the training process. We evaluate this approach on several goal-conditioned robotic manipulation and navigation tasks, and show improvements in sample efficiency over a baseline method which only uses random goal sampling.
翻译:强化学习(RL)代理商可以通过与环境互动,学会解决复杂的连续决策任务。然而,抽样效率仍然是一个重大挑战。在多目标RL领域,需要代理商达到多重目标才能解决复杂任务,提高抽样效率尤其具有挑战性。另一方面,人类或其他生物代理商以更具战略性的方式学习这些任务,遵循一个任务抽样越来越困难的课程,以便逐步和高效地学习。在这项工作中,我们提议一种自动目标生成方法,使用动态距离函数(DDDF),以自我监督的方式生成目标。DDDF是一种功能,它预测在马克夫决策过程(MDP)中任何两个国家之间的动态距离。我们以此在适当的困难水平上制定目标课程,以促进在整个培训过程中高效学习。我们评估了几个有目标的机器人操纵和导航任务,并表明在仅使用随机目标抽样的基线方法上抽样效率的提高。