A commonly used heuristic in RL is experience replay (e.g.~\citet{lin1993reinforcement, mnih2015human}), in which a learner stores and re-uses past trajectories as if they were sampled online. In this work, we initiate a rigorous study of this heuristic in the setting of tabular Q-learning. We provide a convergence rate guarantee, and discuss how it compares to the convergence of Q-learning depending on important parameters such as the frequency and number of replay iterations. We also provide theoretical evidence showing when we might expect this heuristic to strictly improve performance, by introducing and analyzing a simple class of MDPs. Finally, we provide some experiments to support our theoretical findings.
翻译:RL 中常用的脂质学是经验回放(例如 ⁇ citet{lin1993reinforcement, mnih2015human}),其中学习者商店和重新使用过去的轨迹,仿佛是在线抽样使用。在这项工作中,我们在表格Q-学习的设置中开始严格研究这种脂质学。我们提供了趋同率保证,并讨论了它如何与Q-学习的趋同相比较,这取决于重播频率和次数等重要参数。我们还提供了理论证据,表明我们何时可能期望这种脂质通过引入和分析简单的MDP类来严格改进性能。最后,我们提供了一些实验来支持我们的理论发现。