连续控制中本地搜索政策迭代 (Local Search for Policy Iteration in Continuous Control)

Jost Tobias Springenberg,Nicolas Heess,Daniel Mankowitz,Josh Merel,Arunkumar Byravan,Abbas Abdolmaleki,Jackie Kay,Jonas Degrave,Julian Schrittwieser,Yuval Tassa,Jonas Buchli,Dan Belov,Martin Riedmiller

We present an algorithm for local, regularized, policy improvement in reinforcement learning (RL) that allows us to formulate model-based and model-free variants in a single framework. Our algorithm can be interpreted as a natural extension of work on KL-regularized RL and introduces a form of tree search for continuous action spaces. We demonstrate that additional computation spent on model-based policy improvement during learning can improve data efficiency, and confirm that model-based policy improvement during action selection can also be beneficial. Quantitatively, our algorithm improves data efficiency on several continuous control benchmarks (when a model is learned in parallel), and it provides significant improvements in wall-clock time in high-dimensional domains (when a ground truth model is available). The unified framework also helps us to better understand the space of model-based and model-free algorithms. In particular, we demonstrate that some benefits attributed to model-based RL can be obtained without a model, simply by utilizing more computation.

翻译：我们在强化学习(RL)中提出了一个本地常规化和政策改进的算法,使我们能够在一个单一框架内制定基于模型和无模型的变异模型。我们的算法可以被解释为KL正规化RL工作的自然延伸,并引入了一种对连续行动空间的树搜索形式。我们证明学习期间用于基于模型的政策改进的额外计算可以提高数据效率,并确认在选择行动期间基于模型的政策改进也可能是有益的。在数量上,我们的算法可以提高几个连续控制基准的数据效率(当模型同时学习时),并在高维域(当有地面真相模型时)的长钟时间方面提供重大改进。统一框架还帮助我们更好地了解基于模型和无模型的算法空间。特别是,我们证明,光是利用更多的计算,就可以在没有模型的情况下获得基于模型的RL的某些好处。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【MIT】反偏差对比学习，Debiased Contrastive Learning

专知会员服务

91+阅读 · 2020年7月4日

策略梯度方法的算子视图，An operator view of policy gradient methods

专知会员服务

11+阅读 · 2020年6月23日

因果图，Causal Graphs，52页ppt

专知会员服务

250+阅读 · 2020年4月19日