Off-policy sampling and experience replay are key for improving sample efficiency and scaling model-free temporal difference learning methods. When combined with function approximation, such as neural networks, this combination is known as the deadly triad and is potentially unstable. Recently, it has been shown that stability and good performance at scale can be achieved by combining emphatic weightings and multi-step updates. This approach, however, is generally limited to sampling complete trajectories in order, to compute the required emphatic weighting. In this paper we investigate how to combine emphatic weightings with non-sequential, off-line data sampled from a replay buffer. We develop a multi-step emphatic weighting that can be combined with replay, and a time-reversed $n$-step TD learning algorithm to learn the required emphatic weighting. We show that these state weightings reduce variance compared with prior approaches, while providing convergence guarantees. We tested the approach at scale on Atari 2600 video games, and observed that the new X-ETD($n$) agent improved over baseline agents, highlighting both the scalability and broad applicability of our approach.
翻译:离政策抽样和经验重现是提高抽样效率和推广无模型时间差异学习方法的关键。当结合神经网络等功能近似值时,这种组合被称为致命的三合体,而且可能不稳定。最近,已经表明,通过集中加权和多步更新,可以实现规模稳定和良好业绩。不过,这种方法一般限于抽样完整轨迹,以便计算所需的强重加权。在本文中,我们研究了如何将强重权重与从重弹缓冲中抽样的非顺序、离线数据结合起来。我们开发了多步重权重权重,可以与重弹相结合,并开发了时间折叠的TD学习算法,以学习所需的重权重权重。我们表明,这些加权减少了与以往方法的差异,同时提供了趋同保证。我们在Atari 2600视频游戏上测试了该方法的尺度,并观察到新的X-ETD($n$)代理器在基线剂上有所改进,强调了我们方法的可伸缩性和广泛适用性。