Across machine learning, the use of curricula has shown strong empirical potential to improve learning from data by avoiding local optima of training objectives. For reinforcement learning (RL), curricula are especially interesting, as the underlying optimization has a strong tendency to get stuck in local optima due to the exploration-exploitation trade-off. Recently, a number of approaches for an automatic generation of curricula for RL have been shown to increase performance while requiring less expert knowledge compared to manually designed curricula. However, these approaches are seldomly investigated from a theoretical perspective, preventing a deeper understanding of their mechanics. In this paper, we present an approach for automated curriculum generation in RL with a clear theoretical underpinning. More precisely, we formalize the well-known self-paced learning paradigm as inducing a distribution over training tasks, which trades off between task complexity and the objective to match a desired task distribution. Experiments show that training on this induced distribution helps to avoid poor local optima across RL algorithms in different tasks with uninformative rewards and challenging exploration requirements.
翻译:在整个机器学习中,课程的使用显示了通过避免对培训目标进行本地选择来改进数据学习的巨大经验潜力。对于强化学习(RL)来说,课程特别有趣,因为由于勘探-开发的权衡,基础优化有很强的倾向是被困在本地选择中。最近,一些自动生成RL课程的方法显示,与手工设计课程相比,自动生成RL课程提高了绩效,而需要的专家知识较少。然而,这些方法很少从理论角度加以调查,从而无法加深对其机理的理解。在本文中,我们提出了一个在RL中以明确的理论基础自动生成课程的方法。更确切地说,我们正式确定了众所周知的自我节奏学习模式,以引导对培训任务的分配,而培训任务的复杂性与目标之间相互权衡,以匹配预期的任务分配。实验表明,关于这种人工分配的培训有助于避免不同任务中缺乏信息规范的奖赏和具有挑战性的探索要求的地方选择。