批批次非同步的斯托口相近与应用强化学习的趋同 (Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning)

The stochastic approximation algorithm is a widely used probabilistic method for finding a zero of a vector-valued funtion, when only noisy measurements of the function are available. In the literature to date, one can make a distinction between "synchronous" updating, whereby every component of the current guess is updated at each time, and `"synchronous" updating, whereby only one component is updated. In principle, it is also possible to update, at each time instant, some but not all components of $\theta_t$, which might be termed as "batch asynchronous stochastic approximation" (BASA). Also, one can also make a distinction between using a "local" clock versus a "global" clock. In this paper, we propose a unified formulation of batch asynchronous stochastic approximation (BASA) algorithms, and develop a general methodology for proving that such algorithms converge, irrespective of whether global or local clocks are used. These convergence proofs make use of weaker hypotheses than existing results. For example: existing convergence proofs when a local clock is used require that the measurement noise is an i.i.d sequence. Here, it is assumed that the measurement errors form a martingale difference sequence. Also, all results to date assume that the stochastic step sizes satisfy a probabilistic analog of the Robbins-Monro conditions. We replace this by a purely deterministic condition on the irreducibility of the underlying Markov processes. As specific applications to Reinforcement Learning, we introduce ``batch'' versions of the temporal difference algorithm $TD(0)$ for value iteration, and the $Q$-learning algorithm for finding the optimal action-value function, and also permit the use of local clocks instead of a global clock. In all cases, we establish the convergence of these algorithms, under milder conditions than in the existing literature.

翻译：随机近似算法是一种广泛使用的概率方法, 用于寻找一个矢量估值的调味值的零, 当只有对函数进行杂音测量时, 就会被广泛使用。在迄今为止的文献中, 人们也可以区分“ 同步” 更新, 即当前猜测的每个组成部分每次更新, 和“ 同步” 更新, 即仅更新一个组件。原则上, 也可以在每次即时更新一些但并非全部的 $( $) 的成分, 这可能被称为“ 超过同步的 Q- 近似 ” (巴萨 ) 。另外, 在文献中, 人们还可以区分“ 本地” 时钟和“ 全球” 时钟。在此文件中, 我们提出一个“ 同步” 同步近似” 的批量配方, 也就是只更新一个组件。原则上, 不论是否使用了全球或本地时钟, 都会使用这些时钟。这些最佳证据使得我们使用比现有结果更弱的假设值。例如: 当本地时, 时间正在使用固定的轨变变变变变的,, 要求测量的测序。