The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously. In this paper, we investigate the target network as a tool for breaking the deadly triad, providing theoretical support for the conventional wisdom that a target network stabilizes training. We first propose and analyze a novel target network update rule which augments the commonly used Polyak-averaging style update with two projections. We then apply the target network and ridge regularization in several divergent algorithms and show their convergence to regularized TD fixed points. Those algorithms are off-policy with linear function approximation and bootstrapping, spanning both policy evaluation and control, as well as both discounted and average-reward settings. In particular, we provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.
翻译:致命的三合会是指当它同时使用非政策性学习、功能近似和靴子时强化学习算法的不稳定性。 在本文中,我们调查目标网络,将其作为打破致命三合会的工具,为目标网络稳定培训的传统智慧提供理论支持。我们首先提出并分析一个新的目标网络更新规则,用两个预测来补充常用的多功能稳定风格更新。我们然后在若干不同的算法中应用目标网络和峰值正规化,并显示它们与正规化的TD固定点的趋同。这些算法具有线性功能近似和串行,覆盖了政策评价和控制,以及折扣和平均回报环境。特别是,我们提供了第一个非限制性和变化的行为政策下的趋同线性直线性Q$学习算法,而没有双重优化。