Planning - the ability to analyze the structure of a problem in the large and decompose it into interrelated subproblems - is a hallmark of human intelligence. While deep reinforcement learning (RL) has shown great promise for solving relatively straightforward control tasks, it remains an open problem how to best incorporate planning into existing deep RL paradigms to handle increasingly complex environments. One prominent framework, Model-Based RL, learns a world model and plans using step-by-step virtual rollouts. This type of world model quickly diverges from reality when the planning horizon increases, thus struggling at long-horizon planning. How can we learn world models that endow agents with the ability to do temporally extended reasoning? In this work, we propose to learn graph-structured world models composed of sparse, multi-step transitions. We devise a novel algorithm to learn latent landmarks that are scattered (in terms of reachability) across the goal space as the nodes on the graph. In this same graph, the edges are the reachability estimates distilled from Q-functions. On a variety of high-dimensional continuous control tasks ranging from robotic manipulation to navigation, we demonstrate that our method, named L3P, significantly outperforms prior work, and is oftentimes the only method capable of leveraging both the robustness of model-free RL and generalization of graph-search algorithms. We believe our work is an important step towards scalable planning in reinforcement learning.
翻译:规划 — 分析一个大问题的结构并将其分解成相互关联的子问题的能力 — — 是人类智慧的标志。 深强化学习(RL)对于解决相对直截了当的控制任务表现出巨大的希望, 虽然深强化学习(RL)对于解决相对直截了当的控制任务表现出巨大的希望, 但它仍然是一个开放的问题, 如何最好地将规划纳入现有的深度RL范式, 以处理日益复杂的环境。 一个突出的框架, 以模型为基础的RL, 学习一种世界模型和计划, 使用逐步的虚拟推出。 当规划地平线增加, 从而在长期规划中挣扎时, 这种世界模型迅速与现实不同。 我们如何学习世界模型的模型, 这些模型的顶端是那些末端的代理代理代理商, 并且我们从机械操作到前的导航方法, 我们设计出一个巨大的连续控制步骤, 一种从机械操作到我们之前的操作方法, 我们所定义的快速操作方法, 我们用的方法, 我们的顶端是从一个高层次的直径的直径直径直的计算。