We consider infinite-horizon discounted Markov decision problems with finite state and action spaces. We show that with direct parametrization in the policy space, the weighted value function, although non-convex in general, is both quasi-convex and quasi-concave. While quasi-convexity helps explain the convergence of policy gradient methods to global optima, quasi-concavity hints at their convergence guarantees using arbitrarily large step sizes that are not dictated by the Lipschitz constant charactering smoothness of the value function. In particular, we show that when using geometrically increasing step sizes, a general class of policy mirror descent methods, including the natural policy gradient method and a projected Q-descent method, all enjoy a linear rate of convergence without relying on entropy or other strongly convex regularization. In addition, we develop a theory of weak gradient-mapping dominance and use it to prove sharper sublinear convergence rate of the projected policy gradient method. Finally, we also analyze the convergence rate of an inexact policy mirror descent method and estimate its sample complexity under a simple generative model.
翻译:我们认为,在有限的状态和动作空间中,有无限的偏差折扣的Markov决定存在问题。我们表明,在政策空间中直接的对称化中,加权值功能,尽管一般而言不是混凝土,但一般而言,是准混凝土和准混凝土。虽然准混凝土有助于解释政策梯度方法与全球奥地马的趋同性,但准混凝土暗示其趋同性保证使用不由Lipschitz定定定的任意大步尺寸,而不是由Lipschitz定定定的定值的平稳特征。特别是,我们表明,在使用几何级递增的步数时,一般的政策反向下降方法类别,包括自然政策梯度法和预测的Q-白法,都享有一种不依赖诱变模型或其他强烈的凝固性规范的线性趋同性趋同率。此外,我们开发了一种梯度较弱的梯度测支配地位理论,并用它来证明预测的政策梯度方法的细次线性趋同率。最后,我们还分析了不精确的政策反射法方法的趋同率的趋同率,并根据简单的基因模型来估计其样品的复杂性。