Much of the recent successes in Deep Reinforcement Learning have been based on minimizing the squared Bellman error. However, training is often unstable due to fast-changing target Q-values, and target networks are employed to regularize the Q-value estimation and stabilize training by using an additional set of lagging parameters. Despite their advantages, target networks are potentially an inflexible way to regularize Q-values which may ultimately slow down training. In this work, we address this issue by augmenting the squared Bellman error with a functional regularizer. Unlike target networks, the regularization we propose here is explicit and enables us to use up-to-date parameters as well as control the regularization. This leads to a faster yet more stable training method. We analyze the convergence of our method theoretically and empirically validate our predictions on simple environments as well as on a suite of Atari environments. We demonstrate empirical improvements over target network based methods in terms of both sample efficiency and performance. In summary, our approach provides a fast and stable alternative to replace the standard squared Bellman error
翻译:深强化学习最近取得的许多成功都基于将方形贝尔曼错误降到最低程度。然而,由于目标Q值变化迅速,培训往往不稳定,目标网络被用来利用另外一套滞后参数使Q值估计正规化和稳定培训。尽管目标网络有其优势,但它们可能成为使Q值正规化的不灵活方式,最终可能减缓培训。在这项工作中,我们通过功能正规化来增加方形Bellman错误来解决这一问题。与目标网络不同,我们在此提议的正规化是明确的,使我们能够使用最新参数和控制正规化。这导致一种更快、更稳定的培训方法。我们分析了我们方法在理论上和实践中的趋同情况,在简单环境以及阿塔里环境的组合上,我们从抽样效率和绩效两方面对基于目标网络的方法进行了经验上的改进。简言之,我们的方法为取代标准方形贝尔曼错误提供了快速和稳定的替代方法。