DR3:基于价值的深入强化学习要求 (DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization)

Despite overparameterization, deep networks trained via supervised learning are easy to optimize and exhibit excellent generalization. One hypothesis to explain this is that overparameterized deep networks enjoy the benefits of implicit regularization induced by stochastic gradient descent, which favors parsimonious solutions that generalize well on test inputs. It is reasonable to surmise that deep reinforcement learning (RL) methods could also benefit from this effect. In this paper, we discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL setting, leading to poor generalization and degenerate feature representations. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive "aliasing", in stark contrast to the supervised learning case. We back up these findings empirically, showing that feature representations learned by a deep network value function trained via bootstrapping can indeed become degenerate, aliasing the representations for state-action pairs that appear on either side of the Bellman backup. To address this issue, we derive the form of this implicit regularizer and, inspired by this derivation, propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer. When combined with existing offline RL methods, DR3 substantially improves performance and stability, alleviating unlearning in Atari 2600 games, D4RL domains and robotic manipulation from images.

翻译：尽管超度了,但通过监督学习培训的深层次网络很容易优化,并展示出极优的概括性。一个可以解释的假设是,过度量化的深层网络享有由随机梯度下降引起的隐性正规化的好处,这有利于在测试投入中泛泛地反映令人厌恶的解决办法。我们有理由推测,深层强化学习(RL)方法也可以从这一效果中受益。在本文中,我们讨论在监督学习中看到的SGD的隐性正规化效应在离线深度RL设置中实际上会有害,导致不甚普遍和特征特征表现。我们的理论分析表明,当现有的隐性正规化模式用于时间差异学习时,由此产生的常规正规化模式有利于退化的解决方案,而过度的“言辞性”则与受监督的学习案例形成鲜明对比。我们从经验中推回这些结论,表明通过踢球训练的深层次网络价值功能学到的特征表现确实会变质化,例如Bellman 备份中出现的国家-行动配对的描述,导致不透明化的概括性正规化和不言辞式调整3 和不言明性DRRMR3 提出一种常规的常规的常规的自我调整方法。