Hierarchical reinforcement learning has focused on discovering temporally extended actions, such as options, that can provide benefits in problems requiring extensive exploration. One promising approach that learns these options end-to-end is the option-critic (OC) framework. We examine and show in this paper that OC does not decompose a problem into simpler sub-problems, but instead increases the size of the search over policy space with each option considering the entire state space during learning. This issue can result in practical limitations of this method, including sample inefficient learning. To address this problem, we introduce Context-Specific Representation Abstraction for Deep Option Learning (CRADOL), a new framework that considers both temporal abstraction and context-specific representation abstraction to effectively reduce the size of the search over policy space. Specifically, our method learns a factored belief state representation that enables each option to learn a policy over only a subsection of the state space. We test our method against hierarchical, non-hierarchical, and modular recurrent neural network baselines, demonstrating significant sample efficiency improvements in challenging partially observable environments.
翻译:等级强化学习侧重于发现在需要广泛探索的问题中能够带来好处的暂时延长的行动,例如选项,这些选项能够带来好处。一种有希望的方法是最终到最终学习这些选项,一种有希望的方法是选择-批评框架。我们在本文件中检查并显示,OC不会将问题分解成更简单的子问题,而是增加政策空间搜索的规模,每个选项都考虑到整个学习期间的整个国家空间。这个问题可能导致这种方法的实际局限性,包括抽样低效学习。为了解决这一问题,我们引入了深层次选择学习的特有代表摘要(CRADOL),这是一个既考虑时间抽象又考虑特定背景代表抽象的新框架,以有效减少搜索在政策空间上的大小。具体地说,我们的方法学习了一个要素化的信念国家代表,使每个选项能够学习政策时只考虑国家空间的一个小段。我们测试了我们的方法,以等级、非等级和模块化的经常性神经网络基线为基础。我们通过测试了我们的方法,展示了挑战部分可观察环境的样本效率的显著提高。