In real-world decision making tasks, it is critical for data-driven reinforcement learning methods to be both stable and sample efficient. On-policy methods typically generate reliable policy improvement throughout training, while off-policy methods make more efficient use of data through sample reuse. In this work, we combine the theoretically supported stability benefits of on-policy algorithms with the sample efficiency of off-policy algorithms. We develop policy improvement guarantees that are suitable for the off-policy setting, and connect these bounds to the clipping mechanism used in Proximal Policy Optimization. This motivates an off-policy version of the popular algorithm that we call Generalized Proximal Policy Optimization with Sample Reuse. We demonstrate both theoretically and empirically that our algorithm delivers improved performance by effectively balancing the competing goals of stability and sample efficiency.
翻译:在实际决策任务中,数据驱动的强化学习方法必须既稳定,又具有抽样效率。 在线政策方法通常在整个培训过程中产生可靠的政策改进,而非政策方法则通过抽样再利用来更有效地利用数据。 在这项工作中,我们把理论上支持的在线算法的稳定效益与非政策算法的抽样效率结合起来。 我们开发适合非政策环境的政策改进保障,并将这些界限与Proximal政策优化中使用的剪裁机制联系起来。 这促使一种流行算法的脱政策版本,我们称之为普遍化的Proximal政策优化与抽样再利用。 我们在理论上和经验上都表明,我们的算法通过有效平衡稳定性和抽样效率的相互竞争目标,提高了绩效。