Model-based reinforcement learning (RL) is considered to be a promising approach to reduce the sample complexity that hinders model-free RL. However, the theoretical understanding of such methods has been rather limited. This paper introduces a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees. We design a meta-algorithm with a theoretical guarantee of monotone improvement to a local maximum of the expected reward. The meta-algorithm iteratively builds a lower bound of the expected reward based on the estimated dynamical model and sample trajectories, and then maximizes the lower bound jointly over the policy and the model. The framework extends the optimism-in-face-of-uncertainty principle to non-linear dynamical models in a way that requires \textit{no explicit} uncertainty quantification. Instantiating our framework with simplification gives a variant of model-based RL algorithms Stochastic Lower Bounds Optimization (SLBO). Experiments demonstrate that SLBO achieves state-of-the-art performance when only one million or fewer samples are permitted on a range of continuous control benchmark tasks.
翻译:以模型为基础的强化学习(RL)被认为是减少妨碍无模型的RL的样本复杂性的一个很有希望的方法。然而,对此类方法的理论理解相当有限。本文介绍了设计和分析基于模型的RL算法以及理论保证的新算法框架。我们设计了一个元数,从理论上保证单质改进到预期奖励的当地最大值。元数迭代根据估计的动态模型和样样的轨迹,建立了较低的预期奖励范围,然后最大限度地联合利用政策和模型的较低约束值。框架将乐观面的不确定性原则扩大到非线性动态模型,要求提供textit{没有明显的不确定性量化。我们框架的简化提供了基于模型的RL演算法的变式,即小于小于100万个样本时,SLBO在连续基准控制范围中实现了状态。