Policy optimization, which finds the desired policy by maximizing value functions via optimization techniques, lies at the heart of reinforcement learning (RL). In addition to value maximization, other practical considerations arise as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These can often be accounted for via regularized RL, which augments the target value function with a structure-promoting regularizer. Focusing on discounted infinite-horizon Markov decision processes, we propose a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent (arXiv:2102.00135), our algorithm accommodates a general class of convex regularizers and promotes the use of Bregman divergence in cognizant of the regularizer in use. We demonstrate that our algorithm converges linearly to the global solution over an entire range of learning rates, in a dimension-free fashion, even when the regularizer lacks strong convexity and smoothness. In addition, this linear convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to corroborate the appealing performance of GPMD.
翻译:政策优化是通过优化技术最大限度地增加价值功能而找到理想政策,而政策优化是通过优化技术优化价值功能而找到理想的政策,是强化学习的核心。 除了价值最大化之外,还出现了其他实际考虑,包括需要鼓励探索,以及需要确保由于安全、资源和业务限制而学习的政策的某些结构性特性。这些往往可以通过正规化的RL进行核算,该RL通过促进结构的正规化使目标价值功能与促进结构的正规化功能相增强。我们侧重于贴现的无限偏松的Markov决策程序,我们建议一种解决正规化的RL的普惠政策镜底值(GPMD)算法。作为政策镜底的概括(arXiv:212.00135),我们的算法包括了一般的康韦克斯正规化者,并促进了布雷格曼在认识到常规化因素使用方面的差异。我们证明,我们的算法在全系列学习率上是线性地与全球解决方案相融合的,没有维度,即使正规化的调和平稳。此外,这种线性趋同特征的特征特征特征在GPD政策更新中是稳定的。