关于规划在基于模型的深入加强学习中的作用 (On the role of planning in model-based deep reinforcement learning)

Jessica B. Hamrick,Abram L. Friesen,Feryal Behbahani,Arthur Guez,Fabio Viola,Sims Witherspoon,Thomas Anthony,Lars Buesing,Petar Veličković,Théophane Weber

from arxiv, Published at ICLR 2021

Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al., 2019), a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.

翻译：以模型为基础的规划往往被认为对人工剂的深入、仔细推理和概括分析十分必要。虽然基于模型的强化学习(MBRL)最近的成功和功能近似加强了这一假设,但由此产生的基于模型的方法的多样性也使得难以追踪哪些组成部分是成功的动力和原因。在本文件中,我们试图将最近方法的贡献分解为以下三个问题:(1) 规划如何有利于MBRL代理?(2) 在规划中,什么选择能推动业绩?(3) 规划在何种程度上能改进概括性?为了回答这些问题,我们研究了穆泽罗(Schrittwieser等人,2019年)的绩效,一个与许多其他MBRL算法有着紧密联系和重叠部分的先进MBRL算法。我们在广泛的环境中,包括在控制任务、Atari和9x9 Go等环境中,采取若干干预措施和将Muzero推算出。我们的结果表明:(1) 规划在学习过程中最有用,在政策更新和提供更有用的数据分配方面。(2) 利用浅层树和最难的强化的推理学方式,在较复杂的推理学中,这些方法是如何进行更难于学习。