We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. Using the compatible function approximation framework, both methods with log-linear policies can be written as inexact versions of the policy mirror descent (PMD) method. We show that both methods attain linear convergence rates and $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexities using a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other strongly convex regularization. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size.
翻译:我们考虑无限偏顺折扣Markov决策程序,研究自然政策梯度(NPG)和Q-NPG方法与日对线政策类的趋同率,使用兼容功能近似框架,两种有日对线政策的方法都可以作为政策镜底(PMD)法的不精确版本来写成。我们显示,两种方法都达到线性趋同率,并且使用简单、非调整性的几何式递增步积大小,而不用使用引力或其他强烈的矩形正规化。最后,作为副产品,我们获得两种方法的次线性趋同率,并且任意地固定步数大小。