Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case -- which avoid explicit worst-case dependencies on the size of state space -- by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).
翻译:政策梯度方法是用大型国家和(或)行动空间挑战强化学习问题的最有效方法之一,但是,即使它们最基本的理论趋同特性也鲜为人知,包括:它们是否和多快地聚集到一个全球最佳解决办法,或如何因使用有限的参数政策类别而应付近似错误。这项工作提供了在折扣的Markov决策程序(MDPs)中政策梯度方法的计算、近似和抽样规模特性的可辨别特征。我们侧重于两个方面:“表”政策参数化,其中最佳政策包含在类别中,我们显示与最佳政策的全球趋同;准政策类别(考虑日线和神经政策类别),其中可能没有最佳政策,而且我们提供有定量学习结果。这项工作的一个中心贡献是提供一般情况下的近似保证 -- -- 避免国家空间大小上明显最坏的情况依赖 -- -- 在分配变化中与监督的学习建立正式联系。这一特征化表明估计错误、近似误差和探索(通过精确确定的条件)之间的重要相互作用。