In this paper, we propose cautious policy programming (CPP), a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning. Based on the nature of entropy-regularized RL, we derive a new entropy regularization-aware lower bound of policy improvement that only requires estimating the expected policy advantage function. CPP leverages this lower bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation. Different from similar algorithms that are mostly theory-oriented, we also propose a novel interpolation scheme that makes CPP better scale in high dimensional control problems. We demonstrate that the proposed algorithm can trade o? performance and stability in both didactic classic control problems and challenging high-dimensional Atari games.
翻译:在本文中,我们建议谨慎的政策方案编制(CPP),这是一种新的基于价值的强化学习(RL)算法,可以确保学习期间单调政策改进。基于环丙正规化RL的性质,我们得出一种新的对政策改进的分层约束较低,只需要估计预期的政策优势功能即可。CPP利用这一较低约束作为调整政策更新程度以缓解政策振荡的标准。与大多以理论为导向的类似算法不同,我们还提出了一个新的内推法,使CPP在高维控制问题上的规模更大。我们证明,拟议的算法既可以交易传统控制问题的操作性能和稳定性,也可以对高维度的Atari游戏提出挑战。