Off-policy learning ability is an important feature of reinforcement learning (RL) for practical applications. However, even one of the most elementary RL algorithms, temporal-difference (TD) learning, is known to suffer form divergence issue when the off-policy scheme is used together with linear function approximation. To overcome the divergent behavior, several off-policy TD-learning algorithms, including gradient-TD learning (GTD), and TD-learning with correction (TDC), have been developed until now. In this work, we provide a unified view of such algorithms from a purely control-theoretic perspective, and propose a new convergent algorithm. Our method relies on the backstepping technique, which is widely used in nonlinear control theory. Finally, convergence of the proposed algorithm is experimentally verified in environments where the standard TD-learning is known to be unstable.
翻译:政策外学习能力是强化学习(RL)在实际应用方面的一个重要特征。然而,即使最基本的RL算法之一,即时间差异(TD)学习,在使用离政策性计划的同时,与线性函数近似值一起使用时,已知也存在差异问题。为了克服不同的行为,到目前为止,已经开发了几种离政策性TD-学习算法,包括梯度-TD学习(GTD)和带校正的TD-学习(TDC)。在这项工作中,我们从纯控制理论角度对此类算法提供了统一的观点,并提出了新的趋同算法。我们的方法依靠非线性控制理论中广泛使用的后步法。最后,在已知标准TD-学习不稳定的环境中,对拟议算法的趋同进行了实验性核查。</s>