Hierarchical methods in reinforcement learning have the potential to reduce the amount of decisions that the agent needs to perform when learning new tasks. However, finding reusable useful temporal abstractions that facilitate fast learning remains a challenging problem. Recently, several deep learning approaches were proposed to learn such temporal abstractions in the form of options in an end-to-end manner. In this work, we point out several shortcomings of these methods and discuss their potential negative consequences. Subsequently, we formulate the desiderata for reusable options and use these to frame the problem of learning options as a gradient-based meta-learning problem. This allows us to formulate an objective that explicitly incentivizes options which allow a higher-level decision maker to adjust in few steps to different tasks. Experimentally, we show that our method is able to learn transferable components which accelerate learning and performs better than existing prior methods developed for this setting. Additionally, we perform ablations to quantify the impact of using gradient-based meta-learning as well as other proposed changes.
翻译:强化学习中的分层方法有助于降低代理在学习新任务时需要执行的决策数量。然而,找到可重复使用的有用时间抽象以促进快速学习仍然是一个具有挑战性的问题。最近,已经提出了几种深度学习方法,以在端到端的方式中学习这样的时间抽象,以选项的形式。在这项工作中,我们指出了这些方法的几个缺点,并讨论了它们的潜在负面影响。随后,我们制定了可重用选项的期望,并使用这些期望将学习选项的问题作为基于梯度元学习问题来构建。这使我们能够制定一个目标,明确激励那些允许高级决策者在几个步骤内适应不同任务的选项。在实验中,我们展示了我们的方法能够学习可转移的组件,加速学习并且表现比目前针对这种情况开发的现有方法更好。此外,我们执行削减,以量化使用基于梯度的元学习以及其他提出的更改的影响。