We present an algorithm for local, regularized, policy improvement in reinforcement learning (RL) that allows us to formulate model-based and model-free variants in a single framework. Our algorithm can be interpreted as a natural extension of work on KL-regularized RL and introduces a form of tree search for continuous action spaces. We demonstrate that additional computation spent on model-based policy improvement during learning can improve data efficiency, and confirm that model-based policy improvement during action selection can also be beneficial. Quantitatively, our algorithm improves data efficiency on several continuous control benchmarks (when a model is learned in parallel), and it provides significant improvements in wall-clock time in high-dimensional domains (when a ground truth model is available). The unified framework also helps us to better understand the space of model-based and model-free algorithms. In particular, we demonstrate that some benefits attributed to model-based RL can be obtained without a model, simply by utilizing more computation.
翻译:我们在强化学习(RL)中提出了一个本地常规化和政策改进的算法,使我们能够在一个单一框架内制定基于模型和无模型的变异模型。我们的算法可以被解释为KL正规化RL工作的自然延伸,并引入了一种对连续行动空间的树搜索形式。我们证明学习期间用于基于模型的政策改进的额外计算可以提高数据效率,并确认在选择行动期间基于模型的政策改进也可能是有益的。在数量上,我们的算法可以提高几个连续控制基准的数据效率(当模型同时学习时),并在高维域(当有地面真相模型时)的长钟时间方面提供重大改进。统一框架还帮助我们更好地了解基于模型和无模型的算法空间。特别是,我们证明,光是利用更多的计算,就可以在没有模型的情况下获得基于模型的RL的某些好处。