Gradient descent or its variants are popular in training neural networks. However, in deep Q-learning with neural network approximation, a type of reinforcement learning, gradient descent (also known as Residual Gradient (RG)) is barely used to solve Bellman residual minimization problem. On the contrary, Temporal Difference (TD), an incomplete gradient descent method prevails. In this work, we perform extensive experiments to show that TD outperforms RG, that is, when the training leads to a small Bellman residual error, the solution found by TD has a better policy and is more robust against the perturbation of neural network parameters. We further use experiments to reveal a key difference between reinforcement learning and supervised learning, that is, a small Bellman residual error can correspond to a bad policy in reinforcement learning while the test loss function in supervised learning is a standard index to indicate the performance. We also empirically examine that the missing term in TD is a key reason why RG performs badly. Our work shows that the performance of a deep Q-learning solution is closely related to the training dynamics and how an incomplete gradient descent method can find a good policy is interesting for future study.
翻译:在神经网络的训练中,渐渐下降或其变种很受欢迎。然而,在神经网络近似深度的深Q学习中,一种强化学习、梯度下降(又称残余渐变(RG))几乎无法用来解决贝尔曼的残余最小化问题。相反,时间差异(TD)是一个不完整的梯度下降方法。在这项工作中,我们进行了广泛的实验,以表明TD优于RG, 也就是说,当培训导致小贝尔曼的残余错误时,TD发现的解决办法有一个更好的政策,并且比神经网络参数的过错更加有力。我们进一步利用实验来揭示强化学习与监督学习之间的关键差异,也就是说,一个小的贝尔曼的残余错误可以与强化学习的不良政策相对应,而监督学习的测试损失功能则是显示绩效的标准指数。我们还从经验中研究发现,在TD中缺失的术语是RG表现不佳的一个关键原因。我们的工作表明,深Q学习解决办法的绩效与培训动态密切相关,并且如何不完全的梯度下降方法能在未来的研究中找到一个有趣的政策。