Recently, graph-based planning algorithms have gained much attention to solve goal-conditioned reinforcement learning (RL) tasks: they provide a sequence of subgoals to reach the target-goal, and the agents learn to execute subgoal-conditioned policies. However, the sample-efficiency of such RL schemes still remains a challenge, particularly for long-horizon tasks. To address this issue, we present a simple yet effective self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy. Our intuition here is that to reach a target-goal, an agent should pass through a subgoal, so target-goal- and subgoal- conditioned policies should be similar to each other. We also propose a novel scheme of stochastically skipping executed subgoals in a planned path, which further improves performance. Unlike prior methods that only utilize graph-based planning in an execution phase, our method transfers knowledge from a planner along with a graph into policy learning. We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods under various long-horizon control tasks.
翻译:最近,基於圖形的規劃算法在解決目標條件化強化學習(RL)任務方面受到了很多關注:它們提供了一個達到目標的子目標序列,並且代理人學習執行子目標條件化策略。然而,這種RL方案的樣本效率仍然是一個挑戰,特別是對於長時間範圍的任務。為了解決這個問題,我們提出了一種簡單但有效的自我模仿方案,它將子目標條件化策略提煉成目標條件化策略。我們的想法是,為了達到目標,代理人應該通過一個子目標,因此目標條件化策略和子目標條件化策略應該相似。我們還提出了一種新的隨機跳過計劃路徑中執行的子目標的方案,這進一步提高了效果。與僅在執行階段中利用基於圖形的規劃的先前方法不同,我們的方法從規劃者和圖形轉移知識到策略學習。我們在各種長時間範圍控制任務中實證表明,我們的方法可以顯著提高現有目標條件化RL方法的樣本效率。