We study the dynamics of temporal-difference learning with neural network-based value function approximation over a general state space, namely, \emph{Neural TD learning}. Existing analysis of neural TD learning relies on either infinite width-analysis or constraining the network parameters in a (random) compact set; as a result, an extra projection step is required at each iteration. This paper establishes a new convergence analysis of neural TD learning \emph{without any projection}. We show that the projection-free TD learning equipped with a two-layer ReLU network of any width exceeding $poly(\overline{\nu},1/\epsilon)$ converges to the true value function with error $\epsilon$ given $poly(\overline{\nu},1/\epsilon)$ iterations or samples, where $\overline{\nu}$ is an upper bound on the RKHS norm of the value function induced by the neural tangent kernel. Our sample complexity and overparameterization bounds are based on a drift analysis of the network parameters as a stopped random process in the lazy training regime.
翻译:我们用基于神经网络的数值函数近似值对一般国家空间进行时间差异学习的动态研究,即 \ emph{ Neural TD 学习}。 目前对神经TD 学习的分析依赖于无限宽度分析或限制(随机)紧凑组内的网络参数; 因此, 每次循环都需要额外的预测步骤 。 本文对神经TD 学习 \ emph{ 不带任何投影 进行新的趋同分析 。 我们显示, 无投影的TD 学习配有双层ReLU 网络, 宽度超过$( 超线) 1/\ epsilon) 的双层 ReLU 网络 。 我们的样本复杂度和超标定界限以网络参数的漂移分析为基础, 以停止的懒惰性培训过程为基础 。