Model-based reinforcement learning uses models to plan, where the predictions and policies of an agent can be improved by using more computation without additional data from the environment, thereby improving sample efficiency. However, learning accurate estimates of the model is hard. Subsequently, the natural question is whether we can get similar benefits as planning with model-free methods. Experience replay is an essential component of many model-free algorithms enabling sample-efficient learning and stability by providing a mechanism to store past experiences for further reuse in the gradient computational process. Prior works have established connections between models and experience replay by planning with the latter. This involves increasing the number of times a mini-batch is sampled and used for updates at each step (amount of replay per step). We attempt to exploit this connection by doing a systematic study on the effect of varying amounts of replay per step in a well-known model-free algorithm: Deep Q-Network (DQN) in the Mountain Car environment. We empirically show that increasing replay improves DQN's sample efficiency, reduces the variation in its performance, and makes it more robust to change in hyperparameters. Altogether, this takes a step toward a better algorithm for deployment.
 翻译:以模型为基础的强化学习使用模型进行规划,使一个代理商的预测和政策可以在不增加环境数据的情况下使用更多的计算来改进,从而提高抽样效率。然而,学习对模型的准确估计是很困难的。随后,自然的问题是,我们能否获得与无模型方法规划相似的惠益。经验重播是许多无模型算法的重要组成部分,通过提供一个机制来储存过去的经验,以便在梯度计算过程中进一步再利用。以前的工程通过规划在模型和经验重现之间建立了联系。这需要对每个步骤(每步重放的量)进行更多的抽样,并用于更新。我们试图利用这一联系,对在众所周知的无模型算法中每步重播不同数量的影响进行系统研究:山区汽车环境中的深度 QNetwork (DQN) 。我们从经验上表明,越来越多的重播提高了DQN的样本效率,减少了其性能的变异性,并使它更有力地改变超像仪的配置。