The policy gradient (PG) is one of the most popular methods for solving reinforcement learning (RL) problems. However, a solid theoretical understanding of even the "vanilla" PG has remained elusive for long time. In this paper, we apply recent tools developed for the analysis of SGD in non-convex optimization to obtain convergence guarantees for both REINFORCE and GPOMDP under smoothness assumption on the objective function and weak conditions on the second moment of the norm of the estimated gradient. When instantiated under common assumptions on the policy space, our general result immediately recovers existing $\widetilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity guarantees, but for wider ranges of parameters (e.g., step size and batch size $m$) with respect to previous literature. Notably, our result includes the single trajectory case (i.e., $m=1$) and it provides a more accurate analysis of the dependency on problem-specific parameters by fixing previous results available in the literature. We believe that the integration of state-of-the-art tools from non-convex optimization may lead to identify a much broader range of problems where PG methods enjoy strong theoretical guarantees.
翻译:政策梯度(PG)是解决强化学习(RL)问题最受欢迎的方法之一。然而,即使“香草”PG也长期难以获得对甚至“香草”PG的坚实理论理解。在本文件中,我们运用最近开发的工具,对SGD进行非康韦克斯优化分析,以在估计梯度标准第二时刻的客观功能和薄弱条件的平稳假设下,为REINFORCE和GOMDP获得趋同保证。当根据对政策空间的共同假设进行即时,我们的总体结果立即恢复了现有的美元(全局性)=%{O}(\\\\ epsilon}4})样本复杂性保证,但对于较广泛的参数范围(例如,职档尺寸和批量尺寸$)而言,我们采用了最新开发的工具。值得注意的是,我们的结果包括单一的轨迹(即$m=1美元),它通过确定文献中的以往结果,更准确地分析了对特定问题参数的依赖性参数的依赖性。我们认为,从非康韦克斯优化的理论方法中将最先进的工具整合到较广泛的问题的范围。