Integral to recent successes in deep reinforcement learning has been a class of temporal difference methods that use infrequently updated target values for policy evaluation in a Markov Decision Process. Yet a complete theoretical explanation for the effectiveness of target networks remains elusive. In this work, we provide an analysis of this popular class of algorithms, to finally answer the question: `why do target networks stabilise TD learning'? To do so, we formalise the notion of a partially fitted policy evaluation method, which describes the use of target networks and bridges the gap between fitted methods and semigradient temporal difference algorithms. Using this framework we are able to uniquely characterise the so-called deadly triad - the use of TD updates with (nonlinear) function approximation and off-policy data - which often leads to nonconvergent algorithms. This insight leads us to conclude that the use of target networks can mitigate the effects of poor conditioning in the Jacobian of the TD update. Instead, we show that under mild regularity conditions and a well tuned target network update frequency, convergence can be guaranteed even in the extremely challenging off-policy sampling and nonlinear function approximation setting.
翻译:最近深入强化学习的成功是一系列时间差异方法,在马尔科夫决策程序中对政策评价使用不经常更新的目标值。然而,对目标网络有效性的全面理论解释仍然难以找到。在这项工作中,我们对这种广受欢迎的算法进行了分析,以便最终回答问题 : “ 目标网络为什么稳定TD学习? ”为了做到这一点,我们正式确定了一个概念,即采用部分配齐的政策评价方法,说明目标网络的使用情况,并弥合适合的方法和半梯度时间差异算法之间的差距。 利用这个框架,我们能够独到所谓的致命三合一的特征,即利用(非线性)功能近似和离线性数据来使用TD更新,这往往导致非一致的算法。这一洞察使我们得出结论认为,使用目标网络可以减轻TD更新的雅各戈比亚人不适应条件的影响。相反,我们表明,在温和的定期性条件下,在目标网络更新频率之间,即便在极具挑战性的非线性抽样和非线性函数近似的情况下,也可以保证趋同。</s>