Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.
翻译:在强化学习中,基于渐进的估算值方法具有有利的稳定性特性,但通常比时间差异(TD)学习方法慢得多。我们研究了这种慢度的根本原因,并表明中方钟人错误(MSBE)是一个条件不当的损失函数,因为其黑森公司的条件数量很大。为了解决MSBE对梯度方法的不适应性所造成的不利影响,我们建议了一种低复杂性的零散分解方法,该方法大致遵循高斯-牛顿方向,并且对参数化具有同样的活力。我们称为RANS的主要算法效率高,因为它比残余梯度方法快得多,同时具有几乎相同的计算复杂性,并且在我们测试的典型问题上与TD具有竞争力。