Policy gradient methods are widely used in reinforcement learning algorithms to search for better policies in the parameterized policy space. They do gradient search in the policy space and are known to converge very slowly. Nesterov developed an accelerated gradient search algorithm for convex optimization problems. This has been recently extended for non-convex and also stochastic optimization. We use Nesterov's acceleration for policy gradient search in the well-known actor-critic algorithm and show the convergence using ODE method. We tested this algorithm on a scheduling problem. Here an incoming job is scheduled into one of the four queues based on the queue lengths. We see from experimental results that algorithm using Nesterov's acceleration has significantly better performance compared to algorithm which do not use acceleration. To the best of our knowledge this is the first time Nesterov's acceleration has been used with actor-critic algorithm.
翻译:政策梯度方法被广泛用于强化学习算法, 以在参数化的政策空间中寻找更好的政策。 它们可以在政策空间中进行梯度搜索, 并已知会非常缓慢地趋同 。 Nesterov 开发了一个加速梯度搜索算法, 解决 convex 优化问题 。 最近, 用于非convex 和 stopchasic 优化, 并推广了该算法 。 我们使用 Nesterov 的加速度, 在众所周知的行为者- critic 算法中进行政策梯度搜索, 并用 ODE 方法显示这种趋同 。 我们测试了这个算法的进度问题 。 在此, 一个输入的工作被排成基于队列长度的四队列之一 。 我们从实验结果中看到, 使用 Nestrov 加速度的算法比不使用加速的算法要好得多 。 根据我们所知, 这是第一次使用 Nestrov 的加速度, 和 演员- critical 算法一起使用 。