Target networks are at the core of recent success in Reinforcement Learning. They stabilize the training by using old parameters to estimate the $Q$-values, but this also limits the propagation of newly-encountered rewards which could ultimately slow down the training. In this work, we propose an alternative training method based on functional regularization which does not have this deficiency. Unlike target networks, our method uses up-to-date parameters to estimate the target $Q$-values, thereby speeding up training while maintaining stability. Surprisingly, in some cases, we can show that target networks are a special, restricted type of functional regularizers. Using this approach, we show empirical improvements in sample efficiency and performance across a range of Atari and simulated robotics environments.
翻译:目标网络是加强学习最近成功的核心,它们通过使用旧参数来估计Q$值来稳定培训,但是这也限制了新竞争奖励的推广,而这最终会减缓培训速度。在这项工作中,我们提出一种基于功能正规化的替代培训方法,但这一方法没有这种缺陷。与目标网络不同,我们的方法使用最新参数来估计目标美元值,从而加快培训,同时保持稳定。令人惊讶的是,在某些情况下,我们能够显示目标网络是一种特殊的、受限制的功能正规化者。我们采用这种方法,显示了在阿塔里和模拟机器人环境中抽样的效率和绩效方面的实证改进。