Goal-conditioned hierarchical reinforcement learning (HRL) serves as a successful approach to solving complex and temporally extended tasks. Recently, its success has been extended to more general settings by concurrently learning hierarchical policies and subgoal representations. However, online subgoal representation learning exacerbates the non-stationary issue of HRL and introduces challenges for exploration in high-level policy learning. In this paper, we propose a state-specific regularization that stabilizes subgoal embeddings in well-explored areas while allowing representation updates in less explored state regions. Benefiting from this stable representation, we design measures of novelty and potential for subgoals, and develop an efficient hierarchical exploration strategy that actively seeks out new promising subgoals and states. Experimental results show that our method significantly outperforms state-of-the-art baselines in continuous control tasks with sparse rewards and further demonstrate the stability and efficiency of the subgoal representation learning of this work, which promotes superior policy learning.
翻译:最近,通过同时学习等级政策和次级目标的表述方式,其成功推广到更普遍的环境;然而,在线次级目标代表性学习加剧了非固定性的人力资源问题,并提出了在高级别政策学习中探索的挑战。在本文件中,我们提议了一种国家特定的正规化办法,以稳定在探索良好的地区嵌入次级目标,同时允许在探索较少的州区域更新代表性。从这一稳定代表性中受益,我们设计了次级目标的新颖和潜力措施,并制定了高效的等级探索战略,积极寻找新的有希望的次级目标和状态。实验结果显示,我们的方法大大优于持续控制任务中最先进的基线,只带来微薄的回报,并进一步展示了这项工作次级目标学习的稳定性和效率,从而促进了高水平的政策学习。