Batch异步随机逼近的收敛性及其在强化学习中的应用 (Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning)

The stochastic approximation (SA) algorithm is a widely used probabilistic method for finding a zero or a fixed point of a vector-valued funtion, when only noisy measurements of the function are available. In the literature to date, one makes a distinction between ``synchronous'' updating, whereby every component of the current guess is updated at each time, and ``asynchronous'' updating, whereby only one component is updated. In this paper, we study an intermediate situation that we call ``batch asynchronous stochastic approximation'' (BASA), in which, at each time instant, \textit{some but not all} components of the current estimated solution are updated. BASA allows the user to trade off memory requirements against time complexity. We develop a general methodology for proving that such algorithms converge to the fixed point of the map under study. These convergence proofs make use of weaker hypotheses than existing results. Specifically, existing convergence proofs require that the measurement noise is a zero-mean i.i.d\ sequence or a martingale difference sequence. In the present paper, we permit biased measurements, that is, measurement noises that have nonzero conditional mean. Also, all convergence results to date assume that the stochastic step sizes satisfy a probabilistic analog of the well-known Robbins-Monro conditions. We replace this assumption by a purely deterministic condition on the irreducibility of the underlying Markov processes. As specific applications to Reinforcement Learning, we analyze the temporal difference algorithm $TD(\lambda)$ for value iteration, and the $Q$-learning algorithm for finding the optimal action-value function. In both cases, we establish the convergence of these algorithms, under milder conditions than in the existing literature.

翻译：随机逼近（SA）算法是一种广泛使用的概率方法，用于在只有函数的噪声测量可用时找到向量值函数的零点或固定点。到目前为止，文献中区分“同步”更新，即每个猜测的组件在每个时间点都更新，和“异步”更新，即仅更新一个组件。在本文中，我们研究一种中间情况，称为“Batch异步随机逼近”（BASA），其中在每个时间点，\textit {一些但不是全部} 当前估计的解的组件被更新。 BASA允许用户在内存要求和时间复杂度之间进行权衡。我们开发了一种通用方法，用于证明这些算法收敛到所研究的映射的固定点。这些收敛证明利用比现有结果更弱的假设。具体而言，现有的收敛证明需要测量噪声是零均值i.i.d序列或鞅差序列。在本文中，我们允许偏差测量，即具有非零条件均值的测量噪声。此外，迄今为止所有收敛结果都假设随机步长满足庞巴斯－蒙洛条件的概率模拟。我们将此假设替换为关于底层马尔可夫过程不可约性的纯确定性条件。作为强化学习的具体应用，我们分析了时间差异算法 $TD(\lambda)$ 用于值迭代和$Q$学习算法来找到最优行动价值函数。在两种情况下，我们证明了这些算法的收敛性，条件比现有文献更少。