The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For $\gamma$-discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space $\mathcal{S}$ and the effective horizon $\frac{1}{1-\gamma}$, both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize $\eta$ can take \[ \frac{1}{\eta} |\mathcal{S}|^{2^{\Omega\big(\frac{1}{1-\gamma}\big)}} ~\text{iterations} \] to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.
翻译:软式政策梯度( PG) 方法( PG) 在软式政策参数化下将梯度提升为梯度, 可以说是在现代强化学习中实际实施政策优化的一种。 对于 $\ gamma$- discound follow- horizon 表格 Markov 决策程序( MDPs) 来说, 软式PG 方法在寻找接近最佳的政策方面,最近取得了显著的进展。 然而, 先前的结果还不足以解析显著参数的趋同率的明显依赖性, 如国家空间的精确度 $&mathcal{S}$和有效地平面 $\ frac{1\\\\\ gamma}, 两者都可能过大。 在本文中,我们对软式PGGG方法的重复复杂性,尽管假设精确的计算。 具体地说, 我们的软式PGG方法, 递增率的递增率, 只能通过初始的递增性( \\\\\ grma) roup develrial develrial exal exal exal exalation a exlistration express express a exlistration exlishal destrevation exmstrevation ex expeabaliz a expeal dexal dex exm ex exm ex ex exm exm exm extramentalticlement = = =我们 = = = = = = = = ==================================================================================================================================================