We adapt recent tools developed for the analysis of Stochastic Gradient Descent (SGD) in non-convex optimization to obtain convergence guarantees and sample complexities for the vanilla policy gradient (PG) -- REINFORCE and GPOMDP. Our only assumptions are that the expected return is smooth w.r.t. the policy parameters and that the second moment of its gradient satisfies a certain \emph{ABC assumption}. The ABC assumption allows for the second moment of the gradient to be bounded by $A\geq 0$ times the suboptimality gap, $B \geq 0$ times the norm of the full batch gradient and an additive constant $C \geq 0$, or any combination of aforementioned. We show that the ABC assumption is more general than the commonly used assumptions on the policy space to prove convergence to a stationary point. We provide a single convergence theorem under the ABC assumption, and show that, despite the generality of the ABC assumption, we recover the $\widetilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity of PG. Our convergence theorem also affords greater flexibility in the choice of hyper parameters such as the step size and places no restriction on the batch size $m$. Even the single trajectory case (i.e., $m=1$) fits within our analysis. We believe that the generality of the ABC assumption may provide theoretical guarantees for PG to a much broader range of problems that have not been previously considered.
翻译:我们调整了最近开发的工具,用于分析非节奏优化的微粒底部(SGD),以获得香草政策梯度(REINFORCE)和GOMDP(GPOMDP)的趋同保证和样本复杂性。我们唯一的假设是,预期回报率是政策参数的平滑,其梯度的第二个时刻符合某种固定点的假设。ABC假设允许在梯度的第二个时刻受低于最佳差差值的0倍的美元($B\geq美元是整批梯度标准值的0倍($Geq美元)和添加常数常数的常数($GG),或者上述任何组合。我们显示,ABC假设比通常使用的假设更加笼统,以证明与固定点的趋同。我们提供了一种单一的趋同性理论,并且表明,尽管ABC假设很笼统,但我们的全局标准值[O_%] (G_4美元) 的假设值是整个批量值的常价值标准值标准,而我们的深度分析范围也比G的精度更复杂。