Emphatic temporal difference (ETD) learning (Sutton et al., 2016) is a successful method to conduct the off-policy value function evaluation with function approximation. Although ETD has been shown to converge asymptotically to a desirable value function, it is well-known that ETD often encounters a large variance so that its sample complexity can increase exponentially fast with the number of iterations. In this work, we propose a new ETD method, called PER-ETD (i.e., PEriodically Restarted-ETD), which restarts and updates the follow-on trace only for a finite period for each iteration of the evaluation parameter. Further, PER-ETD features a design of the logarithmical increase of the restart period with the number of iterations, which guarantees the best trade-off between the variance and bias and keeps both vanishing sublinearly. We show that PER-ETD converges to the same desirable fixed point as ETD, but improves the exponential sample complexity of ETD to be polynomials. Our experiments validate the superior performance of PER-ETD and its advantage over ETD.
翻译:电磁时间差异(ETD)学习(Sutton等人,2016年)是进行非政策值值评估的成功方法,其功能近似值虽然显示电磁数据在零星上与理想值函数趋同,但众所周知,电磁数据往往会遇到巨大的差异,因此其样本复杂性会随着迭代次数的增多而成倍增加。在这项工作中,我们提出了一种新的电磁数据交换方法,称为PER-ETD(即Priodidalarted-ETD),该方法只对每次迭代评价参数的一段有限时期重新开始并更新后续跟踪跟踪。此外,PER-ETD还设计了与迭代次数的逻辑性增长,这保证了差异和偏差之间的最佳交易,并且始终在亚线性地消失。我们表明,PER-ETD与ETD(即PER-ETD)相接近,但提高了ETD的指数性样本复杂性,使之成为多元性数据。我们的实验证实了ER-ETD及其超度优势。