Off-policy learning allows us to learn about possible policies of behavior from experience generated by a different behavior policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation and off-policy sampling - this is known as the ''deadly triad''. Emphatic temporal difference (ETD($\lambda$)) algorithm ensures convergence in the linear case by appropriately weighting the TD($\lambda$) updates. In this paper, we extend the use of emphatic methods to deep reinforcement learning agents. We show that naively adapting ETD($\lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance. We then derive new emphatic algorithms for use in the context of such algorithms, and we demonstrate that they provide noticeable benefits in small problems designed to highlight the instability of TD methods. Finally, we observed improved performance when applying these algorithms at scale on classic Atari games from the Arcade Learning Environment.
翻译:离政策学习让我们从不同行为政策产生的经验中了解可能的行为政策。 时间差异( TD) 学习算法如果与功能近似和离政策抽样相结合, 可能会变得不稳定, 这被称为“ 死三合一 ” 。 Empatic 时间差异( ETD ($\ lumbda$) ) 算法通过适当加权 TD($\ lumbda$) 更新来确保线性案例的趋同。 在本文中, 我们将重点方法的使用扩大到深度强化学习代理。 我们显示, 将 ETD ($\ lumbda$) 以天真地调整为流行的深强化学习算法, 后者使用前视多步返回法, 结果是表现不佳。 我们随后得出新的重力算法, 用于这种算法, 我们证明这些算法在小问题上提供了显著的好处, 目的是突出TD 方法的不稳定性。 最后, 我们观察到在将这些算法用于Arcade学习环境的经典阿塔里游戏时, 在规模应用这些算法时表现得更好。