Hierarchical methods in reinforcement learning have the potential to reduce the amount of decisions that the agent needs to perform when learning new tasks. However, finding a reusable useful temporal abstractions that facilitate fast learning remains a challenging problem. Recently, several deep learning approaches were proposed to learn such temporal abstractions in the form of options in an end-to-end manner. In this work, we point out several shortcomings of these methods and discuss their potential negative consequences. Subsequently, we formulate the desiderata for reusable options and use these to frame the problem of learning options as a gradient-based meta-learning problem. This allows us to formulate an objective that explicitly incentivizes options which allow a higher-level decision maker to adjust in few steps to different tasks. Experimentally, we show that our method is able to learn transferable components which accelerate learning and performs better than existing prior methods developed for this setting. Additionally, we perform ablations to quantify the impact of using gradient-based meta-learning as well as other proposed changes.
翻译:强化学习的等级方法有可能减少代理机构在学习新任务时需要执行的决定数量。 但是,找到可再使用、有助于快速学习的实用时间抽象方法仍是一个棘手问题。 最近,提出了几个深层次的学习方法,以选择方式从头到尾学习这种时间抽象方法。 在这项工作中,我们指出这些方法的一些缺点,并讨论其潜在的负面后果。随后,我们为可重复使用的选择方案制定了分流法,并用这些方法将学习选择方案的问题作为一个基于梯度的元学习问题来框定。这使我们能够制定一个目标,明确激励选择方案,使高层决策者能够在几个步骤中调整不同的任务。我们实验性地表明,我们的方法能够学习可转移的部件,这些组成部分可以加速学习,并比以前为这一环境制定的方法更好。此外,我们还进行了一些推算,以量化使用梯度的元学习以及其他拟议变化的影响。