Recent development of Deep Reinforcement Learning has demonstrated superior performance of neural networks in solving challenging problems with large or even continuous state spaces. One specific approach is to deploy neural networks to approximate value functions by minimising the Mean Squared Bellman Error function. Despite great successes of Deep Reinforcement Learning, development of reliable and efficient numerical algorithms to minimise the Bellman Error is still of great scientific interest and practical demand. Such a challenge is partially due to the underlying optimisation problem being highly non-convex or using incorrect gradient information as done in Semi-Gradient algorithms. In this work, we analyse the Mean Squared Bellman Error from a smooth optimisation perspective combined with a Residual Gradient formulation. Our contribution is two-fold. First, we analyse critical points of the error function and provide technical insights on the optimisation procure and design choices for neural networks. When the existence of global minima is assumed and the objective fulfils certain conditions we can eliminate suboptimal local minima when using over-parametrised neural networks. We can construct an efficient Approximate Newton's algorithm based on our analysis and confirm theoretical properties of this algorithm such as being locally quadratically convergent to a global minimum numerically. Second, we demonstrate feasibility and generalisation capabilities of the proposed algorithm empirically using continuous control problems and provide a numerical verification of our critical point analysis. We outline the short coming of Semi-Gradients. To benefit from an approximate Newton's algorithm complete derivatives of the Mean Squared Bellman error must be considered during training.
翻译:深强化学习的最近发展显示,神经网络在解决大型甚至连续的国家空间存在的挑战性问题方面表现优异。一个具体的方法是通过将平方贝伦错误功能降到最低程度来部署神经网络以接近价值功能。尽管深强化学习取得了巨大成功,但开发可靠高效的数字算法以尽量减少贝尔曼错误仍具有极大的科学兴趣和实际需求。这种挑战部分是由于以下原因:最优化问题的根源是高度非电解或使用半优级算法中不准确的梯度信息。在这项工作中,我们从平滑的平方格贝尔曼错误的角度,加上一种剩余渐进式配方的配方来分析。我们的贡献是双重的。首先,我们分析错误功能的关键点,并提供关于优化采购和设计神经网络选择的技术见解的深刻见解。当假设全球迷你图的存在,而目标达到某些条件时,在使用超均匀度的神经网络中,我们可以消除不完美的本地缩略度信息。我们可以从一个高效的Apribal Bellerman错误来分析一个平稳的牛顿算法,然后根据我们即将进行的精确的精确的算法分析,用一个最起码的精确的理论分析模型来证实性分析来证明。我们进行这种精确的精确的精确的精确的逻辑分析,在地面分析时,必须分析必须分析必须分析必须提供一种最起码的精确的精确的精确的精确性的分析分析必须的精确性分析必须提供一种当地分析,在进行。在进行中,在进行中必须的精确的精确性的分析分析,在进行。在进行。在进行这种分析时必须的精确的精确的精确的精确性的分析分析时,必须提供一种最起码的精确的精确性的分析分析。在这种分析中,必须的精确的精确的精确性的分析分析。在进行。