The stochastic approximation algorithm is a widely used probabilistic method for finding a zero of a vector-valued funtion, when only noisy measurements of the function are available. In the literature to date, one can make a distinction between "synchronous" updating, whereby every component of the current guess is updated at each time, and `"synchronous" updating, whereby only one component is updated. In principle, it is also possible to update, at each time instant, some but not all components of $\theta_t$, which might be termed as "batch asynchronous stochastic approximation" (BASA). Also, one can also make a distinction between using a "local" clock versus a "global" clock. In this paper, we propose a unified formulation of batch asynchronous stochastic approximation (BASA) algorithms, and develop a general methodology for proving that such algorithms converge, irrespective of whether global or local clocks are used. These convergence proofs make use of weaker hypotheses than existing results. For example: existing convergence proofs when a local clock is used require that the measurement noise is an i.i.d sequence. Here, it is assumed that the measurement errors form a martingale difference sequence. Also, all results to date assume that the stochastic step sizes satisfy a probabilistic analog of the Robbins-Monro conditions. We replace this by a purely deterministic condition on the irreducibility of the underlying Markov processes. As specific applications to Reinforcement Learning, we introduce ``batch'' versions of the temporal difference algorithm $TD(0)$ for value iteration, and the $Q$-learning algorithm for finding the optimal action-value function, and also permit the use of local clocks instead of a global clock. In all cases, we establish the convergence of these algorithms, under milder conditions than in the existing literature.
翻译:随机近似算法是一种广泛使用的概率方法, 用于寻找一个矢量估值的调味值的零, 当只有对函数进行杂音测量时, 就会被广泛使用。 在迄今为止的文献中, 人们也可以区分“ 同步” 更新, 即当前猜测的每个组成部分每次更新, 和“ 同步” 更新, 即仅更新一个组件。 原则上, 也可以在每次即时更新一些但并非全部的 $( $) 的成分, 这可能被称为“ 超过同步的 Q- 近似 ” (巴萨 ) 。 另外, 在文献中, 人们还可以区分“ 本地” 时钟和“ 全球” 时钟。 在此文件中, 我们提出一个“ 同步” 同步近似” 的批量配方, 也就是只更新一个组件。 原则上, 不论是否使用了全球或本地时钟, 都会使用这些时钟。 这些最佳证据使得我们使用比现有结果更弱的假设值。 例如: 当本地时, 时间 正在使用固定的轨变变变变变的,, 要求测量 的测序 。