批批次非同步的斯托口相近与应用强化学习的趋同 (Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning)

The stochastic approximation (SA) algorithm is a widely used probabilistic method for finding a zero or a fixed point of a vector-valued funtion, when only noisy measurements of the function are available. In the literature to date, one makes a distinction between ``synchronous'' updating, whereby every component of the current guess is updated at each time, and ``asynchronous'' updating, whereby only one component is updated. In this paper, we study an intermediate situation that we call ``batch asynchronous stochastic approximation'' (BASA), in which, at each time instant, \textit{some but not all} components of the current estimated solution are updated. BASA allows the user to trade off memory requirements against time complexity. We develop a general methodology for proving that such algorithms converge to the fixed point of the map under study. These convergence proofs make use of weaker hypotheses than existing results. Specifically, existing convergence proofs require that the measurement noise is a zero-mean i.i.d\ sequence or a martingale difference sequence. In the present paper, we permit biased measurements, that is, measurement noises that have nonzero conditional mean. Also, all convergence results to date assume that the stochastic step sizes satisfy a probabilistic analog of the well-known Robbins-Monro conditions. We replace this assumption by a purely deterministic condition on the irreducibility of the underlying Markov processes. As specific applications to Reinforcement Learning, we analyze the temporal difference algorithm $TD(\lambda)$ for value iteration, and the $Q$-learning algorithm for finding the optimal action-value function. In both cases, we establish the convergence of these algorithms, under milder conditions than in the existing literature.

翻译：随机近似( SSA) 算法是一种广泛使用的概率方法, 用于寻找一个零或固定点的矢量值调味值( BaSA), 而此时只有对函数的杂音测量。在迄今为止的文献中, 人们可以区分“ 同步性” 更新, 即当前猜测的每个部分每次更新, 以及“ 同步性” 更新, 也就是仅更新一个组件。在本文中, 我们研究一种中间情况, 我们称之为“ 超常性超常性调调值“ (BASA ), 其中每次更新当前估计解决方案中的特定值应用(textit{some but not all} 元值的组件。在目前的文献中, BSA 允许用户根据时间复杂性来交换记忆要求。我们开发了一个一般方法, 证明这种算法与所研究的地图的固定点一致。这些趋同证明比现有结果更弱。具体地说, 现有的趋同性证据要求测量噪音是零- i. d\ 序列, 或马德性差值。在目前估算值的测算序列中,, 我们的测算结果。