We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean--square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a "martingale loss function", whose solution is proved to be the best approximation of the true value function in the mean--square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the "martingale orthogonality conditions" with test functions. Solving these equations in different ways recovers various classical TD algorithms, such as TD($\lambda$), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approximate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero, and we provide the convergence rate. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications.
翻译:我们提出了一个统一的框架,用于研究政策评价(PE)和相关的时间差异(TD)方法,用于在连续的时间和空间里强化学习。我们证明PE相当于维持一个过程的马丁格状态。从这个角度,我们发现平均平方差差差差差与马丁格的二次变异相相近,因此不是PE的适当目标。我们提出了两种方法来使用马丁格的定性来设计PE算法。第一个方法是将“间差损失函数”最小化,其解决方案被证明是平均意义上的真正价值函数的最佳近似。这个方法可以解释经典的梯度蒙特-卡尔洛算法。第二个方法基于一种公式体系,称为“马丁格或高度变异性条件”,与测试功能相近。用不同的方式解决这些等方差会恢复各种传统的TD算法,例如TD($\lambda$)、LSTD和GTD。不同的测试函数选择从某种意义上决定所得出的解决方案与真实值功能相近。此外,我们证明,任何对等式的理论趋同性算法都能够使对应的数值和对等的数值的数值趋一致。