软软式政策分级法能够用指数时间凝聚 (Softmax Policy Gradient Methods Can Take Exponential Time to Converge)

The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For $\gamma$-discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space $\mathcal{S}$ and the effective horizon $\frac{1}{1-\gamma}$, both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize $\eta$ can take \[ \frac{1}{\eta} |\mathcal{S}|^{2^{\Omega\big(\frac{1}{1-\gamma}\big)}} ~\text{iterations} \] to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.

翻译：软式政策梯度( PG) 方法( PG) 在软式政策参数化下将梯度提升为梯度, 可以说是在现代强化学习中实际实施政策优化的一种。对于 $\ gamma$- discound follow- horizon 表格 Markov 决策程序( MDPs) 来说, 软式PG 方法在寻找接近最佳的政策方面,最近取得了显著的进展。然而, 先前的结果还不足以解析显著参数的趋同率的明显依赖性, 如国家空间的精确度 $&mathcal{S}$和有效地平面 $\ frac{1\\\\\ gamma}, 两者都可能过大。在本文中,我们对软式PGGG方法的重复复杂性,尽管假设精确的计算。具体地说, 我们的软式PGG方法, 递增率的递增率, 只能通过初始的递增性( \\\\\ grma) roup develrial develrial exal exal exal exalation a exlistration express express a exlistration exlishal destrevation exmstrevation ex expeabaliz a expeal dexal dex exm ex exm ex ex exm exm exm extramentalticlement = = =我们 = = = = = = = = ==================================================================================================================================================