The classical algorithms used in tabular reinforcement learning (Value Iteration and Policy Iteration) have been shown to converge linearly with a rate given by the discount factor $\gamma$ of a discounted Markov Decision Process. Recently, there has been an increased interest in the study of gradient based methods. In this work, we show that the dimension-free linear $\gamma$-rate of classical reinforcement learning algorithms can be achieved by a general family of unregularised Policy Mirror Descent (PMD) algorithms under an adaptive step-size. We also provide a matching worst-case lower-bound that demonstrates that the $\gamma$-rate is optimal for PMD methods. Our work offers a novel perspective on the convergence of PMD. We avoid the use of the performance difference lemma beyond establishing the monotonic improvement of the iterates, which leads to a simple analysis that may be of independent interest. We also extend our analysis to the inexact setting and establish the first dimension-free $\varepsilon$-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result.
翻译:列表强化学习所用的古典算法(Value 迭代和政策迭代)被证明线性地趋同于按折扣的Markov 决策程序的折扣系数$\gamma美元给出的利率。 最近,对基于梯度的方法的研究兴趣增加。 在这项工作中,我们表明,传统强化学习算法的无维线性线性价比($\gamma美元)可以通过一个非常规政策镜源算法(PMD)的普通家庭在适应性步数下实现。我们还提供了一个匹配的最坏的低尺寸的比值下限,表明美元/gamma美元率对PMD方法来说是最佳的。我们的工作为PMD的趋同提供了一个新的视角。我们避免使用性能差异,而不只是建立外层的单一性能改进,从而导致一个可能具有独立兴趣的简单分析。我们还将我们的分析扩大到不正规的设置,并为在基因模型下非常规的PMD确定第一个无维度的 $\vareplon-opatimic 样本复杂性,改进了最著名的结果。