Temporal difference (TD) learning is a simple algorithm for policy evaluation in reinforcement learning. The performance of TD learning is affected by high variance and it can be naturally enhanced with variance reduction techniques, such as the Stochastic Variance Reduced Gradient (SVRG) method. Recently, multiple works have sought to fuse TD learning with SVRG to obtain a policy evaluation method with a geometric rate of convergence. However, the resulting convergence rate is significantly weaker than what is achieved by SVRG in the setting of convex optimization. In this work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG. We prove a geometric convergence bound with predetermined learning rate of 1/8, that is identical to the convergence bound available for SVRG in the convex setting.
翻译:时间差异(TD)学习是强化学习中政策评价的简单算法。 TD学习的表现受到差异很大的影响,自然可以通过减少差异技术(如Stochatic 差异减少梯度法(SVRG))来提升。最近,多项工作试图将TD学习与SVRG结合起来,以获得具有几何趋同率的政策评价方法。然而,由此形成的趋同率大大低于SRVRG在确定凝聚优化方面达到的水平。在这项工作中,我们利用最近对TD学习的解释,将适当选择的函数的梯度分割开来,从而简化了算法并与SVRG(SVRG)使用TD。我们证明,与预定的1/8学习速率挂钩的几何趋同,这与SVRG在配置中可以达到的趋同值相同。