Designing effective model-based reinforcement learning algorithms is difficult because the ease of data generation must be weighed against the bias of model-generated data. In this paper, we study the role of model usage in policy optimization both theoretically and empirically. We first formulate and analyze a model-based reinforcement learning algorithm with a guarantee of monotonic improvement at each step. In practice, this analysis is overly pessimistic and suggests that real off-policy data is always preferable to model-generated on-policy data, but we show that an empirical estimate of model generalization can be incorporated into such analysis to justify model usage. Motivated by this analysis, we then demonstrate that a simple procedure of using short model-generated rollouts branched from real data has the benefits of more complicated model-based algorithms without the usual pitfalls. In particular, this approach surpasses the sample efficiency of prior model-based methods, matches the asymptotic performance of the best model-free algorithms, and scales to horizons that cause other model-based methods to fail entirely.
翻译:设计出有效的基于模型的强化学习算法是困难的,因为数据生成的方便度必须对照模型生成数据的偏差来权衡。在本文中,我们研究了模型使用在理论上和经验上的政策优化中的作用。我们首先制定和分析基于模型的强化学习算法,保证每个步骤的单一性改进。在实践中,这种分析过于悲观,表明实际的离政策数据总是比模型生成的政策数据更可取,但我们表明,在这种分析中可以纳入对模型概括化的经验性估计,以证明使用模型是合理的。根据这项分析,我们然后证明,使用从实际数据分流的短模型生成的推出的简单程序可以带来更复杂的基于模型的算法的好处,而没有通常的陷阱。 特别是,这一方法超过了以前基于模型的方法的抽样效率,与最佳无模式算法的无症状性性性表现相匹配,以及与其他模式方法完全失败的地平线相比。