For reinforcement learning on complex stochastic systems where many factors dynamically impact the output trajectories, it is desirable to effectively leverage the information from historical samples collected in previous iterations to accelerate policy optimization. Classical experience replay allows agents to remember by reusing historical observations. However, the uniform reuse strategy that treats all observations equally overlooks the relative importance of different samples. To overcome this limitation, we propose a general variance reduction based experience replay (VRER) framework that can selectively reuse the most relevant samples to improve policy gradient estimation. This selective mechanism can adaptively put more weight on past samples that are more likely to be generated by the current target distribution. Our theoretical and empirical studies show that the proposed VRER can accelerate the learning of optimal policy and enhance the performance of state-of-the-art policy optimization approaches.
翻译:在复杂的随机系统上,许多因素对产出轨迹产生动态影响,为了强化这些系统的学习,有必要有效地利用从以往迭代中收集的历史样本中收集的信息,以加快政策优化。古典经验的重现使代理商能够通过重复历史观察记住。然而,处理所有观测的统一再利用战略同样忽略了不同样本的相对重要性。为了克服这一局限性,我们提议了一个基于减少差异的总体经验重放框架,可以有选择地重新利用最相关的样本,以改进政策梯度估计。这一选择性机制可以适应性地对更可能由当前目标分布生成的过去样本给予更多的重视。我们的理论和经验研究表明,拟议的VRER可以加速学习最佳政策,提高最新政策优化方法的绩效。