Stochastic Approximation (SA) is a popular approach for solving fixed-point equations where the information is corrupted by noise. In this paper, we consider an SA involving a contraction mapping with respect to an arbitrary norm, and show its finite-sample error bounds while using different stepsizes. The idea is to construct a smooth Lyapunov function using the generalized Moreau envelope, and show that the iterates of SA have negative drift with respect to that Lyapunov function. Our result is applicable in Reinforcement Learning (RL). In particular, we use it to establish the first-known convergence rate of the V-trace algorithm for off-policy TD-learning. Moreover, we also use it to study TD-learning in the on-policy setting, and recover the existing state-of-the-art results for $Q$-learning. Importantly, our construction results in only a logarithmic dependence of the convergence bound on the size of the state-space.
翻译:Stochastec Appronication (SA) 是解决固定点方程式的流行方法, 信息因噪音而腐蚀。 在本文中, 我们考虑的是一个SA, 涉及任意规范的收缩映射, 并使用不同阶梯显示其有限的抽样误差界限 。 其想法是使用通用的 Moreau 信封构建一个平滑的 Lyapunov 函数, 并显示 SA 的迭代对 Lyapunov 函数有负面的漂移 。 我们的结果适用于加强学习 。 特别是, 我们用它来建立已知的 V- trace 运算法的首次合并率, 用于脱离政策的 TD- 学习 。 此外, 我们还用它来在政策设置中学习TD 学习, 并恢复目前用于 $Q 学习的艺术状态结果 。 重要的是, 我们的构造结果只是根据国家空间的大小对齐点的对数依赖 。