In this paper, we study the dynamics of temporal difference learning with neural network-based value function approximation over a general state space, namely, \emph{Neural TD learning}. We consider two practically used algorithms, projection-free and max-norm regularized Neural TD learning, and establish the first convergence bounds for these algorithms. An interesting observation from our results is that max-norm regularization can dramatically improve the performance of TD learning algorithms, both in terms of sample complexity and overparameterization. In particular, we prove that max-norm regularization appears to be more effective than $\ell_2$-regularization, again both in terms of sample complexity and overparameterization. The results in this work rely on a novel Lyapunov drift analysis of the network parameters as a stopped and controlled random process.
翻译:在本文中,我们研究了时间差异学习动态,以神经网络基值功能近似于一般国家空间,即\ emph{ Neural TD 学习。我们考虑了两种实际使用的算法,即无投影和最高北度的正常神经TD 学习,并为这些算法建立了第一个趋同界限。我们从结果中得出的一个有趣的观察是,最大北纬正规化可以极大地改善TD 学习算法的性能,无论是抽样复杂性还是超分度。特别是,我们证明最高北纬正规化似乎比$\ ell_2$的正规化更为有效,在抽样复杂性和超分度方面都是如此。这项工作的结果依赖于对网络参数的新的Lyapunov流学分析,将其作为一个拦截和控制的随机过程。