Model-Based Reinforcement Learning (RL) is widely believed to have the potential to improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay (ER) can be considered a simple kind of model, which has proved extremely effective at improving the stability and efficiency of deep RL. In principle, a learned parametric model could improve on ER by generalizing from real experience to augment the dataset with additional plausible experience. However, owing to the many design choices involved in empirically successful algorithms, it can be very hard to establish where the benefits are actually coming from. Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a general theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. In these experiments, we take care to control for other factors in order to isolate, insofar as possible, the benefit of using experience generated by a learned model relative to ER alone.
翻译:广泛认为,模型强化学习(RL)有可能通过让一个代理人综合大量想象的经验而提高样本效率,从而提高样本效率。经验重现(ER)可被视为一种简单的模型,在提高深层RL的稳定性和效率方面证明极为有效。原则上,一个学习的参数模型可以改进ER,从实际经验中加以概括,用更多可信的经验来增加数据集。然而,由于经验上的成功算法涉及许多设计选择,因此很难确定实际效益来自何方。在这里,我们提供理论和经验方面的深入了解,以了解何时以及如何预期从一个学习的模型产生的数据是有用的。首先,我们提供一个一般性的理论,说明学习模型作为中间步骤如何缩小可能的价值功能的一套,而不是直接从使用贝尔曼方程式的数据中学习价值功能。第二,我们提供了一个实例,从经验上表明模型如何在更具体的环境中产生类似效果,以神经网络功能相近似。最后,我们提供了广泛的实验,展示了以模型为基础学习模型的好处,以便在一个比较复杂的环境中进行在线RL系统化的学习。我们从结构中学会了其他的复杂程度,从而可以将这种可能的系统化的模型,从而进行其他的系统化。