Goal-conditioned Hierarchical Reinforcement Learning (HRL) is a promising approach for scaling up reinforcement learning (RL) techniques. However, it often suffers from training inefficiency as the action space of the high-level, i.e., the goal space, is large. Searching in a large goal space poses difficulty for both high-level subgoal generation and low-level policy learning. In this paper, we show that this problem can be effectively alleviated by restricting the high-level action space from the whole goal space to a $k$-step adjacent region of the current state using an adjacency constraint. We theoretically prove that in a deterministic Markov Decision Process (MDP), the proposed adjacency constraint preserves the optimal hierarchical policy, while in a stochastic MDP the adjacency constraint induces a bounded state-value suboptimality determined by the MDP's transition structure. We further show that this constraint can be practically implemented by training an adjacency network that can discriminate between adjacent and non-adjacent subgoals. Experimental results on discrete and continuous control tasks including challenging simulated robot locomotion and manipulation tasks show that incorporating the adjacency constraint significantly boosts the performance of state-of-the-art goal-conditioned HRL approaches.
翻译:以目标为条件的等级强化学习(HRL)是提升强化学习(RL)技术的一个很有希望的方法,但往往因为高层次(即目标空间)的行动空间很大而缺乏培训效率,因此往往缺乏培训效率,因为高层次(即目标空间)的行动空间很大。在大型目标空间中搜索对高层次次级目标生成和低层次政策学习都构成困难。在本文件中,我们表明,将高层次行动空间从整个目标空间限制到使用相邻限制的当前状态的相邻地区,可以有效地缓解这一问题。我们理论上证明,在确定性的马尔科夫决策过程(MDP)中,拟议的相对性制约保留了最佳等级政策,而在随机性 MDP中,相对性制约导致受约束的国家价值次优于由MDP过渡结构确定的优势。我们进一步表明,通过培训一个能区分相邻和非相邻子目标的相邻网络,可以切实落实这一制约。我们从理论上证明,在确定性的马尔科夫决策过程中,拟议中的相对性限制保留了最佳的等级政策,而在不连续和持续控制任务中,包括具有挑战性磁性操纵的HR定位上,对立的模拟控制任务具有挑战性。