Despite the empirical success of the deep Q network (DQN) reinforcement learning algorithm and its variants, DQN is still not well understood and it does not guarantee convergence. In this work, we show that DQN can indeed diverge and cease to operate in realistic settings. Although there exist gradient-based convergent methods, we show that they actually have inherent problems in learning dynamics which cause them to fail even in simple tasks. To overcome these problems, we propose a convergent DQN algorithm (C-DQN) that is guaranteed to converge and can work with large discount factors (0.9998). It learns robustly in difficult settings and can learn several difficult games in the Atari 2600 benchmark that DQN fails to solve. Our codes have been publicly released and can be used to reproduce our results.
翻译:尽管深层次的Q网络强化学习算法及其变体取得了经验性的成功,但DQN仍未很好地被理解,无法保证趋同。在这项工作中,我们表明DQN确实可以分裂,不再在现实环境中运作。虽然存在基于梯度的趋同方法,但我们表明,它们实际上在学习动态方面存在着内在问题,导致即使在简单的任务中也失败。为了克服这些问题,我们建议了一种统一的DQN算法(C-DQN),它保证会汇合并能够用大量折扣因素(0.9998)来工作。它在困难的环境中学习,并能在Atari 2600 基准中学习一些DQN无法解决的困难游戏。我们的代码已经公开发布,可以用来复制我们的结果。