Reinforcement learning is a promising approach to learning robotics controllers. It has recently been shown that algorithms based on finite-difference estimates of the policy gradient are competitive with algorithms based on the policy gradient theorem. We propose a theoretical framework for understanding this phenomenon. Our key insight is that many dynamical systems (especially those of interest in robotics control tasks) are nearly deterministic -- i.e., they can be modeled as a deterministic system with a small stochastic perturbation. We show that for such systems, finite-difference estimates of the policy gradient can have substantially lower variance than estimates based on the policy gradient theorem. Finally, we empirically evaluate our insights in an experiment on the inverted pendulum.
翻译:强化学习是学习机器人控制器的一个很有希望的方法。 最近已经表明,基于政策梯度的有限差异估计值的算法与基于政策梯度理论的算法具有竞争力。 我们提出了一个理解这一现象的理论框架。我们的主要见解是,许多动态系统(特别是那些对机器人控制任务感兴趣的系统)几乎具有确定性 -- -- 即它们可以建模为具有小小的随机扰动作用的确定性系统。我们表明,对于这类系统,政策梯度的有限差异估计值可能大大低于基于政策梯度理论的估算值。最后,我们从经验上评估了我们关于垂直弯曲实验的洞察力。