Hierarchical reinforcement learning (HRL) holds great potential for sample-efficient learning on challenging long-horizon tasks. In particular, letting a higher level assign subgoals to a lower level has been shown to enable fast learning on difficult problems. However, such subgoal-based methods have been designed with static reinforcement learning environments in mind and consequently struggle with dynamic elements beyond the immediate control of the agent even though they are ubiquitous in real-world problems. In this paper, we introduce Hierarchical reinforcement learning with Timed Subgoals (HiTS), an HRL algorithm that enables the agent to adapt its timing to a dynamic environment by not only specifying what goal state is to be reached but also when. We discuss how communicating with a lower level in terms of such timed subgoals results in a more stable learning problem for the higher level. Our experiments on a range of standard benchmarks and three new challenging dynamic reinforcement learning environments show that our method is capable of sample-efficient learning where an existing state-of-the-art subgoal-based HRL method fails to learn stable solutions.
翻译:高层次强化学习(HRL)对于在具有挑战性的长方位任务上进行抽样高效学习具有巨大的潜力。特别是,让高层次的子目标被分配到较低层次,已经证明能够快速地学习困难问题。然而,这种次级目标方法的设计是以静态强化学习环境为思想设计的,因此与代理商直接无法控制的动态元素抗争,尽管在现实世界的问题中,这些元素普遍存在。在本文中,我们引入了与时间子目标(HITS)一起的等级强化学习,这是一种HRL算法,使代理商不仅能够说明要达到什么目标状态,而且可以在何时将其时间适应动态环境。我们讨论了与较低层次的子目标进行沟通如何在较高层次上产生更稳定的学习问题。我们在一系列标准基准和三个具有挑战性的新动态强化学习环境上的实验表明,我们的方法能够进行抽样高效学习,因为现有的基于最新水平的次级目标的HRL方法无法学习稳定的解决办法。