Gradient temporal difference (Gradient TD) algorithms are a popular class of stochastic approximation (SA) algorithms used for policy evaluation in reinforcement learning. Here, we consider Gradient TD algorithms with an additional heavy ball momentum term and provide choice of step size and momentum parameter that ensures almost sure convergence of these algorithms asymptotically. In doing so, we decompose the heavy ball Gradient TD iterates into three separate iterates with different step sizes. We first analyze these iterates under one-timescale SA setting using results from current literature. However, the one-timescale case is restrictive and a more general analysis can be provided by looking at a three-timescale decomposition of the iterates. In the process, we provide the first conditions for stability and convergence of general three-timescale SA. We then prove that the heavy ball Gradient TD algorithm is convergent using our three-timescale SA analysis. Finally, we evaluate these algorithms on standard RL problems and report improvement in performance over the vanilla algorithms.
翻译:渐变时间差异( 梯度 TD) 算法是用来在强化学习中进行政策评估的流行类随机近似( SA) 算法。 在这里, 我们考虑带额外重球动力的渐进式TD 算法, 并提供步数和动量参数的选择, 几乎可以确保这些算法的渐进式融合。 这样, 我们分解重球梯度 TD 变异为三个不同的代数。 我们首先利用当前文献的结果, 在一次性的SA 设置下分析这些代数 。 但是, 一次性案件是限制性的, 并且可以通过查看三度的轴数分解来提供更一般性的分析。 在此过程中, 我们为一般三度算法的稳定性和趋同提供了第一个条件 。 然后, 我们用我们三度尺度的 SA 分析来证明重球梯度 TD 算法是趋同的。 最后, 我们用标准 RL 问题来评估这些算法, 并报告香草算法的性能的改进情况 。