Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API). While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy. Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies, arguing that this is easier to implement and tune. In this work, we study this implementation choice in more detail. We compare the use of KL divergence as a constraint vs. as a regularizer, and point out several optimization issues with the widely-used constrained approach. We show that the constrained algorithm is not guaranteed to converge even on simple problem instances where the constrained problem can be solved exactly, and in fact incurs linear expected regret. With approximate implementation using softmax policies, we show that regularization can improve the optimization landscape of the original objective. We demonstrate these issues empirically on several bandit and RL environments.
翻译:许多强化学习算法可被视为近似政策迭代(API)的版本。虽然标准API通常表现不佳,但已经表明,通过将KL-Diverence对先前政策的每项政策更新常规化,学习可以稳定下来。广受欢迎的实际算法,如TRPO、MPO和VMPO等,以连续政策的KL-diverence限制取代了正规化,认为这比较容易执行和调和。在这项工作中,我们更详细地研究这一执行选择。我们比较了KL差异的利用,将其作为一种制约,而不是作为正规化器,并指出了与广泛使用的限制办法之间的若干优化问题。我们表明,即使在简单的问题案例中,限制的算法并不能保证能够完全解决,事实上会产生线性预期的遗憾。我们通过使用软式政策来大致实施,我们表明,规范化可以改善原始目标的优化环境。我们用经验在几个土匪和RL环境中展示了这些问题。