Experience replay methods, which are an essential part of reinforcement learning(RL) algorithms, are designed to mitigate spurious correlations and biases while learning from temporally dependent data. Roughly speaking, these methods allow us to draw batched data from a large buffer such that these temporal correlations do not hinder the performance of descent algorithms. In this experimental work, we consider the recently developed and theoretically rigorous reverse experience replay (RER), which has been shown to remove such spurious biases in simplified theoretical settings. We combine RER with optimistic experience replay (OER) to obtain RER++, which is stable under neural function approximation. We show via experiments that this has a better performance than techniques like prioritized experience replay (PER) on various tasks, with a significantly smaller computational complexity. It is well known in the RL literature that choosing examples greedily with the largest TD error (as in OER) or forming mini-batches with consecutive data points (as in RER) leads to poor performance. However, our method, which combines these techniques, works very well.
翻译:经验重现方法是强化学习(RL)算法的一个基本部分,设计这些方法是为了在从时间依赖数据中学习的同时,减少虚假的关联和偏见。 粗略地说,这些方法使我们能够从一个大的缓冲中提取分批数据,这样这些时间相关性不会妨碍下游算法的运行。 在这个实验工作中,我们考虑到最近开发的和理论上严格的反向重现(RER),这已证明可以消除简化理论环境中的这种虚假偏见。我们把RER与乐观的经验重现(OER)结合起来,以获得RER++,这在神经功能近似下是稳定的。我们通过实验发现,这比在各种任务上优先经验重现(PER)的技术(PER)有更好的性能,而计算的复杂性则要小得多。 在RL文献中,人们非常清楚的是,选择与最大的TD错误(如在OER)或形成带有连续数据点(RER)的微型比对立点的例子,这会导致不良的性能。 但是,我们结合这些技术的方法非常有效。