以稳定的次级目标代表制学习方式进行高效率的等级探索 (Efficient Hierarchical Exploration with Stable Subgoal Representation Learning)

Goal-conditioned hierarchical reinforcement learning (HRL) serves as a successful approach to solving complex and temporally extended tasks. Recently, its success has been extended to more general settings by concurrently learning hierarchical policies and subgoal representations. However, online subgoal representation learning exacerbates the non-stationary issue of HRL and introduces challenges for exploration in high-level policy learning. In this paper, we propose a state-specific regularization that stabilizes subgoal embeddings in well-explored areas while allowing representation updates in less explored state regions. Benefiting from this stable representation, we design measures of novelty and potential for subgoals, and develop an efficient hierarchical exploration strategy that actively seeks out new promising subgoals and states. Experimental results show that our method significantly outperforms state-of-the-art baselines in continuous control tasks with sparse rewards and further demonstrate the stability and efficiency of the subgoal representation learning of this work, which promotes superior policy learning.

翻译：最近,通过同时学习等级政策和次级目标的表述方式,其成功推广到更普遍的环境;然而,在线次级目标代表性学习加剧了非固定性的人力资源问题,并提出了在高级别政策学习中探索的挑战。在本文件中,我们提议了一种国家特定的正规化办法,以稳定在探索良好的地区嵌入次级目标,同时允许在探索较少的州区域更新代表性。从这一稳定代表性中受益,我们设计了次级目标的新颖和潜力措施,并制定了高效的等级探索战略,积极寻找新的有希望的次级目标和状态。实验结果显示,我们的方法大大优于持续控制任务中最先进的基线,只带来微薄的回报,并进一步展示了这项工作次级目标学习的稳定性和效率,从而促进了高水平的政策学习。

相关内容

表示学习

关注 186

表示学习是通过利用训练数据来学习得到向量表示，这可以克服人工方法的局限性。表示学习通常可分为两大类，无监督和有监督表示学习。大多数无监督表示学习方法利用自动编码器（如去噪自动编码器和稀疏自动编码器等）中的隐变量作为表示。目前出现的变分自动编码器能够更好的容忍噪声和异常值。然而，推断给定数据的潜在结构几乎是不可能的。目前有一些近似推断的策略。此外，一些无监督表示学习方法旨在近似某种特定的相似性度量。提出了一种无监督的相似性保持表示学习框架，该框架使用矩阵分解来保持成对的DTW相似性。通过学习保持DTW的shaplets，即在转换后的空间中的欧式距离近似原始数据的真实DTW距离。有监督表示学习方法可以利用数据的标签信息，更好地捕获数据的语义结构。孪生网络和三元组网络是目前两种比较流行的模型，它们的目标是最大化类别之间的距离并最小化了类别内部的距离。