Despite their success, policy gradient methods suffer from high variance of the gradient estimate, which can result in unsatisfactory sample complexity. Recently, numerous variance-reduced extensions of policy gradient methods with provably better sample complexity and competitive numerical performance have been proposed. After a compact survey on some of the main variance-reduced REINFORCE-type methods, we propose ProbAbilistic Gradient Estimation for Policy Gradient (PAGE-PG), a novel loopless variance-reduced policy gradient method based on a probabilistic switch between two types of updates. Our method is inspired by the PAGE estimator for supervised learning and leverages importance sampling to obtain an unbiased gradient estimator. We show that PAGE-PG enjoys a $\mathcal{O}\left( \epsilon^{-3} \right)$ average sample complexity to reach an $\epsilon$-stationary solution, which matches the sample complexity of its most competitive counterparts under the same setting. A numerical evaluation confirms the competitive performance of our method on classical control tasks.
翻译:尽管取得了成功,但政策梯度方法在梯度估计方面差异很大,这可能导致抽样复杂性不尽人意。最近,提出了许多政策梯度方法差异性扩大,抽样复杂性和竞争性数字性能都明显提高。在对一些主要差异性降低的REINFORCE类型方法进行细致调查之后,我们提议对政策梯度(PAGE-PG)进行概率渐进渐进式渐进式增量性估计,这是基于两种更新的概率性能取而代之的无循环性差异性政策梯度新办法。我们的方法受到受监督学习PAGE估测器的启发,利用重要性取样获得公正的梯度估测器。我们表明,PAGE-P享有平均样本复杂性$\mathcal{O ⁇ left(\epsilon ⁇ 3}\right),以达到美元等值平均样本性解决方案,这与同一环境下最有竞争力的对应方的样本复杂性相符。一个数字评价证实了我们传统控制任务方法的竞争性表现。