The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For $\gamma$-discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space $\mathcal{S}$ and the effective horizon $\frac{1}{1-\gamma}$, both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that softmax PG methods can take exponential time -- in terms of $|\mathcal{S}|$ and $\frac{1}{1-\gamma}$ -- to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration. This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.
翻译:软式政策梯度(PG)方法在软式政策参数化下具有梯度,可以说是在现代强化学习中实际执行政策优化的一种做法。对于以美元计价的无限偏顺表式马可夫决策程序(MDPs),最近,在确定软式PG方法的全球趋同程度以寻找接近最佳的政策方面取得了显著进展。然而,先前的结果还不足以分解显著参数的趋同率的明显依赖性,例如国家空间的基点($mathcal{S}$)和有效地平线($\frac{1>1\\\\\\\\\\gamma}$),两者都可能过大。在本文中,尽管假设使用精确的梯度计算方法,但我们对软式PG方法的重复复杂性发出了一个悲观的信息。具体地说,我们软式PG方法的趋同性趋同率速度,即以美元=maxcal cal-cammas $(美元)和美元=fracal c$) 有效地平整地平时,我们最接近地稳定地调整了正轨规则。