Sutton, Szepesv\'{a}ri and Maei introduced the first gradient temporal-difference (GTD) learning algorithms compatible with both linear function approximation and off-policy training. The goal of this paper is (a) to propose some variants of GTDs with extensive comparative analysis and (b) to establish new theoretical analysis frameworks for the GTDs. These variants are based on convex-concave saddle-point interpretations of GTDs, which effectively unify all the GTDs into a single framework, and provide simple stability analysis based on recent results on primal-dual gradient dynamics. Finally, numerical comparative analysis is given to evaluate these approaches.
翻译:Sutton、Szepeev\'{a}ri和Maei引入了第一个与线性函数近似值和非政策培训兼容的梯度时间差异学习算法(GTD),本文件的目标是:(a) 提出一些具有广泛比较分析的GTD的变种,(b) 为GTD建立新的理论分析框架。这些变种基于对GTD的共振峰值解释,有效地将所有GTD合并成一个单一框架,并根据初度梯度动态的最新结果提供简单的稳定性分析。最后,为评估这些方法提供了数字比较分析。