The variance reduced gradient estimators for policy gradient methods has been one of the main focus of research in the reinforcement learning in recent years as they allow acceleration of the estimation process. We propose a variance reduced policy gradient method, called SGDHess-PG, which incorporates second-order information into stochastic gradient descent (SGD) using momentum with an adaptive learning rate. SGDHess-PG algorithm can achieve $\epsilon$-approximate first-order stationary point with $\tilde{O}(\epsilon^{-3})$ number of trajectories, while using a batch size of $O(1)$ at each iteration. Unlike most previous work, our proposed algorithm does not require importance sampling techniques which can compromise the advantage of variance reduction process. Our extensive experimental results show the effectiveness of the proposed algorithm on various control tasks and its advantage over the state of the art in practice.
翻译:政策梯度方法的梯度差异估计值降低是近年来加强学习的主要研究重点之一,因为这些方法可以加快估算过程。我们建议采用一个差异减少政策梯度方法,称为SGDHess-PG,将二级信息纳入随机梯度梯度梯度梯度下降(SGD),使用适应性学习率的动力进行整合。SGDHess-PG算法可以达到$\epsilon$-近似一级固定点,用$\tilde{O}(\epsilon_Q}-3})美元计算轨道数,同时在每次迭代使用批量的O(1)美元。与大多数以往的工作不同,我们提议的算法并不需要重要的取样技术来损害差异减少过程的优势。我们广泛的实验结果显示,拟议的算法在各种控制任务上的有效性及其在实际中对艺术状态的优势。