Eligibility traces are an effective technique to accelerate reinforcement learning by smoothly assigning credit to recently visited states. However, their online implementation is incompatible with modern deep reinforcement learning algorithms, which rely heavily on i.i.d. training data and offline learning. We utilize an efficient, recursive method for computing {\lambda}-returns offline that can provide the benefits of eligibility traces to any value-estimation or actor-critic method. We demonstrate how our method can be combined with DQN, DRQN, and A3C to greatly enhance the learning speed of these algorithms when playing Atari 2600 games, even under partial observability. Our results indicate several-fold improvements to sample efficiency on Seaquest and Q*bert. We expect similar results for other algorithms and domains not considered here, including those with continuous actions.
翻译:资格追踪是一种有效的方法,通过向最近访问过的各州顺利分配信用来加快强化学习。然而,它们的在线实施与现代深度强化学习算法不相容,后者严重依赖i.d.培训数据和离线学习。我们使用一种高效的、循环的方法计算 lambda} - 返回离线,它能为任何价值估计或行为者-批评方法提供资格追踪的好处。我们展示了我们的方法如何与DQN、DRQN和A3C相结合,以大大提高在玩 Atari 2600 游戏时这些算法的学习速度,甚至部分可观测性。我们的结果显示,在Seaquest 和 ⁇ bert 的抽样效率方面有几倍改进。我们期望这里没有考虑的其他算法和领域,包括连续行动算法和领域,也能取得类似的结果。