We discuss the approximation of the value function for infinite-horizon discounted Markov Reward Processes (MRP) with nonlinear functions trained with the Temporal-Difference (TD) learning algorithm. We first consider this problem under a certain scaling of the approximating function, leading to a regime called lazy training. In this regime, the parameters of the model vary only slightly during the learning process, a feature that has recently been observed in the training of neural networks, where the scaling we study arises naturally, implicit in the initialization of their parameters. Both in the under- and over-parametrized frameworks, we prove exponential convergence to local, respectively global minimizers of the above algorithm in the lazy training regime. We then compare this scaling of the parameters to the mean-field regime, where the approximately linear behavior of the model is lost. Under this alternative scaling we prove that all fixed points of the dynamics in parameter space are global minimizers. We finally give examples of our convergence results in the case of models that diverge if trained with non-lazy TD learning, and in the case of neural networks.
翻译:我们讨论与时空差异(TD)学习算法培训的非线性功能Markov Reward Process(MRP)的无限和偏差折扣Markov Reward Process(MRP)的值函数近似值。我们首先在接近功能的某种尺度下考虑这一问题,导致形成一种称为懒惰训练的制度。在这个制度下,模型参数参数参数在学习过程中的参数只在学习过程中略有不同,这是最近从神经网络培训中观察到的一个特点,我们研究的尺寸是自然产生的,隐含在参数初始化中。无论是在低平衡框架还是过度平衡框架中,我们证明在懒惰训练制度中,上述算法分别与全球最小化的指数趋同。我们然后将参数的这一尺度与平均系统进行比较,因为模型的大致线性行为已经丢失。在这种边际系统中,我们证明参数空间中所有固定的动态点都是全球最小化的。我们最后举出了我们所研究的趋同结果的例子,在模型中,如果经过非偏差的TD学习,以及在神经网络中,我们用不同的例子。