Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes $\mathtt{VRMPO}$ algorithm: a sample efficient policy gradient method with stochastic mirror descent. In $\mathtt{VRMPO}$, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed $\mathtt{VRMPO}$ needs only $\mathcal{O}(\epsilon^{-3})$ sample trajectories to achieve an $\epsilon$-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that $\mathtt{VRMPO}$ outperforms the state-of-the-art policy gradient methods in various settings.
翻译:提高抽样效率一直是加强学习的一个长期目标。 本文提出了 $\ matht{ VRMPO} 算法 : 一种具有随机镜像底部的抽样有效政策梯度方法。 在 $\ matht{ VRMPO} $ 中, 提出了一个新的差异变换政策梯度估计值, 以提高抽样效率。 我们证明, $\ matht{ VRMPO} 的建议只需要 $\ mathcal{ O} (\ psilon}-3} ) 美元 样本轨迹来实现 $\ explon$- 近似第一阶固定点, 与政策优化的最佳样本复杂性相匹配。 广泛的实验结果表明, $\ matht{ VRMPO} 显示, $matht{ VRMPO} 美元在各种环境下都比最先进的政策梯度方法要好。