We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a \L{}ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate $O(e^{-c \cdot t})$ toward softmax optimal policy $(c > 0)$. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new $\Omega(1/t)$ lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform \L{}ojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.
翻译:我们在表格设置中为更好地理解政策梯度方法作出了三项贡献。 首先,我们表明,有了真正的梯度,政策梯度与软成模准的差价以O(1/t)美元率趋同,而常数则取决于问题和初始化。这一结果大大扩大了最近的无症状趋同结果。分析基于两个结论:软式政策梯度满足了一个\L ⁇ ojasiewicz的不平等,优化期间采取最佳行动的最低可能性可以从最初值上加以约束。第二,我们从趋同率的角度分析正统政策梯度,并显示其线性趋同率大大加快了O(e-c\c\cdont t})美元,使之达到软式最佳政策(c > 0)美元。这导致在最近的文献中解决了一个公开的问题。最后,将上述两项结果和额外的美元(1美元)较低约束结果结合起来,我们解释如何改进政策优化,即使从真正的梯度的角度,从趋同率的角度分析,也表明它享有大大加快的线性趋同率的趋同率,进一步解释了目前对正统性研究的结果。