用于随机变异性不平等的简单和最佳方法,II:马尔科维亚噪音和强化学习政策评价 (Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning)

The focus of this paper is on stochastic variational inequalities (VI) under Markovian noise. A prominent application of our algorithmic developments is the stochastic policy evaluation problem in reinforcement learning. Prior investigations in the literature focused on temporal difference (TD) learning by employing nonsmooth finite time analysis motivated by stochastic subgradient descent leading to certain limitations. These encompass the requirement of analyzing a modified TD algorithm that involves projection to an a-priori defined Euclidean ball, achieving a non-optimal convergence rate and no clear way of deriving the beneficial effects of parallel implementation. Our approach remedies these shortcomings in the broader context of stochastic VIs and in particular when it comes to stochastic policy evaluation. We developed a variety of simple TD learning type algorithms motivated by its original version that maintain its simplicity, while offering distinct advantages from a non-asymptotic analysis point of view. We first provide an improved analysis of the standard TD algorithm that can benefit from parallel implementation. Then we present versions of a conditional TD algorithm (CTD), that involves periodic updates of the stochastic iterates, which reduce the bias and therefore exhibit improved iteration complexity. This brings us to the fast TD (FTD) algorithm which combines elements of CTD and the stochastic operator extrapolation method of the companion paper. For a novel index resetting policy FTD exhibits the best known convergence rate. We also devised a robust version of the algorithm that is particularly suitable for discounting factors close to 1.

翻译：本文的重点是Markovian噪音下的随机变异性(VI) 。我们的算法发展的一个突出应用是强化学习中的随机政策评价问题。以前对文献的调查侧重于时间差异(TD)的学习,采用非随机定时时间分析,其动机是随机分向下下降,从而导致某些限制。这包括分析修改的TD算法的要求,该算法涉及向优先定义的Euclidean球投影,实现非最佳趋同率,而没有明确的方法得出平行执行的有利效果。我们的方法纠正了这些在更宽泛的随机变异性VI的广义背景下的缺点,特别是在进行随机化政策评价时的缺点。我们开发了各种简单的TD学习型算法,其动机是保持其简单性,同时提供了非随机分析观点的独特优势。我们首先对标准的TD算法进行了更好的分析,该算法可以从平行执行中受益。然后我们提出了一个有条件的TD算法的版本,这特别需要定期更新精细的算法的精确性变现,从而将精细的精细的精细的算方法转化为精细的变精细的精细的精细的逻辑。这样将精细的精细的精细的精细的变的精细的算法,从而将精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的算。