We study the effect of baselines in on-policy stochastic policy gradient optimization, and close the gap between the theory and practice of policy optimization methods. Our first contribution is to show that the \emph{state value} baseline allows on-policy stochastic \emph{natural} policy gradient (NPG) to converge to a globally optimal policy at an $O(1/t)$ rate, which was not previously known. The analysis relies on two novel findings: the expected progress of the NPG update satisfies a stochastic version of the non-uniform \L{}ojasiewicz (N\L{}) inequality, and with probability 1 the state value baseline prevents the optimal action's probability from vanishing, thus ensuring sufficient exploration. Importantly, these results provide a new understanding of the role of baselines in stochastic policy gradient: by showing that the variance of natural policy gradient estimates remains unbounded with or without a baseline, we find that variance reduction \emph{cannot} explain their utility in this setting. Instead, the analysis reveals that the primary effect of the value baseline is to \textbf{reduce the aggressiveness of the updates} rather than their variance. That is, we demonstrate that a finite variance is \emph{not necessary} for almost sure convergence of stochastic NPG, while controlling update aggressiveness is both necessary and sufficient. Additional experimental results verify these theoretical findings.
翻译:我们研究了基线在政策性政策梯度优化中的影响,并缩小了政策性优化方法理论和实践之间的差距。我们的第一个贡献是表明 emph{state value} 基线允许政策性随机值 =emph{natural} 政策梯度(NPG) 以以前所不知道的美元(1/t) 利率($O(美元) /t) 美元率趋同到全球最佳政策。分析基于两个新发现:NPG更新的预期进展满足了一个非统一化 =ojasiewicz (N\L ⁇ }) 不平等和概率1 国家值基线阻止了最佳行动消失的可能性,从而确保了充分的探索。 重要的是,这些结果使人们重新理解了基线在随机政策梯度梯度梯度中所起的作用:通过显示自然政策梯度估计的差异仍然不受或没有基线限制,我们发现差异减少 =解释其在此背景下的效用。相反,分析表明,价值基线的主要作用是使其最佳行动概率概率消失,而我们又能够充分证实其侵略性更新。