Policy optimization, which learns the policy of interest by maximizing the value function via large-scale optimization techniques, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL, which augments the target value function with a structure-promoting regularization term. Focusing on an infinite-horizon discounted Markov decision process, this paper proposes a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent Lan (2021), the proposed algorithm accommodates a general class of convex regularizers as well as a broad family of Bregman divergence in cognizant of the regularizer in use. We demonstrate that our algorithm converges linearly over an entire range of learning rates, in a dimension-free fashion, to the global solution, even when the regularizer lacks strong convexity and smoothness. In addition, this linear convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to corroborate the applicability and appealing performance of GPMD.
翻译:政策优化是通过大规模优化技术最大限度地发挥价值功能来学习有关政策,是现代强化学习(RL)的核心。除了价值最大化之外,还普遍出现其他实际考虑,包括需要鼓励探索,以及由于安全、资源和业务限制而确保所学政策的某些结构性特性。这些考虑往往可以通过采用正规化的RL来解释,通过结构促进正规化的术语来扩大目标价值功能。本文侧重于无穷和折扣的Markov决策程序,提出了解决常规化RL的普遍政策镜像(GPMD)算法(GPMD)。作为政策镜原 Lan (2021年)的概括化,拟议的算法考虑到一般的康韦克斯正规化者类别以及布里格曼在认识到正规化作用方面的广泛差异。我们表明,我们的算法以无层化的方式将整个学习率直线地结合全球解决方案,即使正规化者缺乏强大的一致性和平稳性,也提出了解决RMD的通用政策。此外,这一直线性趋同性的政策更新是稳定地向上推进政策。