We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. Using the compatible function approximation framework, both methods with log-linear policies can be written as approximate versions of the policy mirror descent (PMD) method. We show that both methods attain linear convergence rates and $\mathcal{O}(1/\epsilon^2)$ sample complexities using a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other strongly convex regularization. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size.
翻译:我们考虑无限顺差折扣Markov决策程序,研究自然政策梯度(NPG)和Q-NPG方法与日对线政策类的趋同率,使用兼容功能近似框架,两种有日对线政策的方法都可以作为政策镜底(PMD)方法的近似版本来写成,我们显示两种方法都达到线性趋同率和1美元=mathcal{O}(1/\epsilon ⁇ 2)的样本复杂度,使用简单、非适应性的几何式递增步数大小,而不用使用引力或其他强烈的矩形规范。最后,作为副产品,我们获得了具有任意不变步数大小的两种方法的次线性趋同率。