Variance-reduced gradient estimators for policy gradient methods have been one of the main focus of research in the reinforcement learning in recent years as they allow acceleration of the estimation process. We propose a variance-reduced policy-gradient method, called SHARP, which incorporates second-order information into stochastic gradient descent (SGD) using momentum with a time-varying learning rate. SHARP algorithm is parameter-free, achieving $\epsilon$-approximate first-order stationary point with $O(\epsilon^{-3})$ number of trajectories, while using a batch size of $O(1)$ at each iteration. Unlike most previous work, our proposed algorithm does not require importance sampling which can compromise the advantage of variance reduction process. Moreover, the variance of estimation error decays with the fast rate of $O(1/t^{2/3})$ where $t$ is the number of iterations. Our extensive experimental evaluations show the effectiveness of the proposed algorithm on various control tasks and its advantage over the state of the art in practice.
翻译:近年来,政策梯度方法的差别梯度估计值是强化学习的主要研究重点之一,因为可以加快估算过程。我们提议了一种差异减少政策梯度方法,称为SHARP,它将二级信息纳入随机梯度梯度下降(SGD),使用时间变化的学习速度。SHARP算法没有参数,以O(\epsilon)美元达到一阶固定点(约合1美元)的数量,同时在每次循环中使用批量值为O(1)美元。与大多数以往的工作不同,我们提议的算法并不需要重要取样,而这种取样不会损害差异减少过程的优势。此外,估计错误的差异随着美元(1/t ⁇ 2/3)的快速率下降,而美元是迭代数。我们广泛的实验评估显示,提议的算法对各种控制任务的有效性及其在实践中对艺术状况的优势。