We extend the idea underlying the success of green simulation assisted policy gradient (GS-PG) to partial historical trajectory reuse for infinite-horizon Markov Decision Processes (MDP). The existing GS-PG method was designed to learn from complete episodes or process trajectories, which limits its applicability to low-data environment and online process control. In this paper, the mixture likelihood ratio (MLR) based policy gradient estimation is used to leverage the information from historical state decision transitions generated under different behavioral policies. We propose a variance reduction experience replay (VRER) approach that can intelligently select and reuse most relevant transition observations, improve the policy gradient estimation accuracy, and accelerate the learning of optimal policy. Then we create a process control strategy by incorporating VRER with the state-of-the-art step-based policy optimization approaches such as actor-critic method and proximal policy optimizations. The empirical study demonstrates that the proposed policy gradient methodology can significantly outperform the existing policy optimization approaches.
翻译:我们把绿色模拟辅助政策梯度成功背后的理念推广到用于无限偏松Markov决策进程(MDP)的部分历史轨迹再利用。现有的GS-PG方法旨在从完整的过程或过程轨迹中学习,这限制了其适用于低数据环境和在线程序控制。在本文中,基于混合概率比(MLR)的政策梯度估计被用于利用根据不同行为政策产生的历史状态决定过渡所产生的信息。我们建议了减少差异经验重放(VRER)方法,该方法可以明智地选择和再利用最相关的过渡观察,提高政策梯度估计准确性,并加速学习最佳政策。然后,我们创建了程序控制战略,将VRER与最先进的分步优化政策方法(如行为者-激进方法和准政策优化)相结合。实证研究表明,拟议的政策梯度方法可以大大超越现有政策优化方法。