Goal-conditioned hierarchical reinforcement learning (GCHRL) provides a promising approach to solving long-horizon tasks. Recently, its success has been extended to more general settings by concurrently learning hierarchical policies and subgoal representations. Although GCHRL possesses superior exploration ability by decomposing tasks via subgoals, existing GCHRL methods struggle in temporally extended tasks with sparse external rewards, since the high-level policy learning relies on external rewards. As the high-level policy selects subgoals in an online learned representation space, the dynamic change of the subgoal space severely hinders effective high-level exploration. In this paper, we propose a novel regularization that contributes to both stable and efficient subgoal representation learning. Building upon the stable representation, we design measures of novelty and potential for subgoals, and develop an active hierarchical exploration strategy that seeks out new promising subgoals and states without intrinsic rewards. Experimental results show that our approach significantly outperforms state-of-the-art baselines in continuous control tasks with sparse rewards.
翻译:以目标为条件的等级强化学习(GCHL)为解决长期横向强化任务提供了很有希望的方法。最近,它的成功通过同时学习等级政策和次级目标表述方式扩展到了更一般的环境。虽然GCHL拥有通过次级目标将任务分解的优越的勘探能力,但现有的GCHL方法在长期延长的任务中挣扎,外部奖励很少,因为高级别政策学习依靠外部奖励。由于高层政策在网上学习的展示空间选择次级目标,次级目标空间的动态变化严重阻碍了高级别的有效探索。在本文件中,我们提出了有助于稳定和高效次级目标代表学习的新规范。我们以稳定的代表性为基础,设计新的措施和次级目标的潜力,并制定一项积极的等级探索战略,寻求新的有希望的次级目标和没有内在奖励的国家。实验结果显示,我们的方法大大超越了持续控制任务中以微量回报为基础的最新基线。