We study the effect of stochasticity in on-policy policy optimization, and make the following four contributions. First, we show that the preferability of optimization methods depends critically on whether stochastic versus exact gradients are used. In particular, unlike the true gradient setting, geometric information cannot be easily exploited in the stochastic case for accelerating policy optimization without detrimental consequences or impractical assumptions. Second, to explain these findings we introduce the concept of committal rate for stochastic policy optimization, and show that this can serve as a criterion for determining almost sure convergence to global optimality. Third, we show that in the absence of external oracle information, which allows an algorithm to determine the difference between optimal and sub-optimal actions given only on-policy samples, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely. That is, an uninformed algorithm either converges to a globally optimal policy with probability $1$ but at a rate no better than $O(1/t)$, or it achieves faster than $O(1/t)$ convergence but then must fail to converge to the globally optimal policy with some positive probability. Finally, we use the committal rate theory to explain why practical policy optimization methods are sensitive to random initialization, then develop an ensemble method that can be guaranteed to achieve near-optimal solutions with high probability.
翻译:我们研究政策优化中的随机率效应,并做出以下四个贡献。首先,我们证明优化方法的偏好性取决于是否使用随机率和精确梯度。特别是,与真正的梯度设置不同,几何信息在加快政策优化的随机性个案中难以轻易利用,而不会造成有害后果或不切实际的假设。第二,解释这些结论,我们引入了随机政策优化的托付率概念,并表明这可以作为确定几乎可以确定与全球最佳性趋同的标准。第三,我们表明,在缺乏外部或极好信息的情况下,允许算法来确定最佳和次最佳行动之间的差别,而只有政策样本上给出的这种算法则不同,在利用几何地测量法加速趋同和几乎实现最佳性之间,存在着内在的权衡取舍。 也就是说,不知情的算法要么与全球最佳政策相吻合,其概率为1美元,但比率不高于1美元,要么比美元更接近于美元。第三,但在缺乏外部或极值信息的情况下,我们不得不不能够将接近最佳和次最佳政策趋同于某种高的概率,最后可以解释如何实现最佳政策。