Temporal abstraction in reinforcement learning (RL), offers the promise of improving generalization and knowledge transfer in complex environments, by propagating information more efficiently over time. Although option learning was initially formulated in a way that allows updating many options simultaneously, using off-policy, intra-option learning (Sutton, Precup & Singh, 1999), many of the recent hierarchical reinforcement learning approaches only update a single option at a time: the option currently executing. We revisit and extend intra-option learning in the context of deep reinforcement learning, in order to enable updating all options consistent with current primitive action choices, without introducing any additional estimates. Our method can therefore be naturally adopted in most hierarchical RL frameworks. When we combine our approach with the option-critic algorithm for option discovery, we obtain significant improvements in performance and data-efficiency across a wide variety of domains.
翻译:强化学习(RL)的时空抽象化(RL)提供了改善复杂环境中的常规化和知识转让的希望,方法是在一段时间内更有效地传播信息。虽然最初制定选项学习的方式允许同时更新许多选项,使用非政策性、选择内学习(Sutton,Precup & Singh,1999年),但最近许多等级强化学习方法只一次更新一个选项:目前正在实施的选项。我们在深层强化学习中重新审视并扩展选项内学习,以便能够更新与当前原始行动选择一致的所有选项,而不引入任何额外的估计。因此,我们的方法可以自然地在大多数等级的RL框架中采用。当我们把我们的方法与选项发现选项的选项-批评算法结合起来时,我们在广泛的领域在绩效和数据效率方面都取得了显著的改进。