We consider the problem of policy evaluation for continuous-time processes using the temporal-difference learning algorithm. More precisely, from the time discretization of a stochastic differential equation, we intend to learn the continuous value function using TD(0). First, we show that the standard TD(0) algorithm is doomed to fail when the time step tends to zero because of the stochastic part of the dynamics. Then, we propose an additive zero-mean correction to the temporal difference making it robust with respect to vanishing time steps. We propose two algorithms: the first one being model-based since it requires to know the drift function of the dynamics; the second one being model-free. We prove the convergence of the model-based algorithm to the continuous-time solution under a linear-parametrization assumption in two different regimes: one with a convex regularization of the problem; and the second using the Polyak-Juditsy averaging method with constant step size and without regularization. The convergence rate obtained in the latter regime is comparable with the state of the art for the simpler problem of linear regression using stochastic gradient descent methods. From a totally different perspective, our method may be applied to solve second-order elliptic equations in non-divergent form using machine learning.
翻译:我们考虑的是使用时间差异学习算法对连续时间过程进行政策评价的问题。更准确地说,我们打算用TD(0)来用TD(0)来学习连续值函数。首先,我们表明标准TD(0)算法注定在时间步骤由于动态的随机部分而趋向于零的时候会失败。然后,我们提议对时间差异进行添加零度修正,使其在消逝时间步骤方面强健。我们提议了两种算法:第一个以模型为基础,因为它需要了解动态的漂移功能;第二个以模型为基础;第二个以模型为基础;我们证明基于模型的算法与线性对齐假设假设下的连续时间解决方案在两种不同制度中是趋同的:一个是问题交错的;第二个是使用Polyak-Juditsy平均法,其步骤大小不变,且不正规化。在后一种制度中获得的趋同率与利用静态梯度梯度下降方法解决二次线性回归的最简单问题时的状态相当。从完全不同的角度看,从一个完全不同的角度看,我们的方法可能采用非机器式的学习方法。