Watkins' and Dayan's Q-learning is a model-free reinforcement learning algorithm that iteratively refines an estimate for the optimal action-value function of an MDP by stochastically "visiting" many state-ation pairs [Watkins and Dayan, 1992]. Variants of the algorithm lie at the heart of numerous recent state-of-the-art achievements in reinforcement learning, including the superhuman Atari-playing deep Q-network [Mnih et al., 2015]. The goal of this paper is to reproduce a precise and (nearly) self-contained proof that Q-learning converges. Much of the available literature leverages powerful theory to obtain highly generalizable results in this vein. However, this approach requires the reader to be familiar with and make many deep connections to different research areas. A student seeking to deepen their understand of Q-learning risks becoming caught in a vicious cycle of "RL-learning Hell". For this reason, we give a complete proof from start to finish using only one external result from the field of stochastic approximation, despite the fact that this minimal dependence on other results comes at the expense of some "shininess".
翻译:Watkins'和Dayaan的Q-学习是一种没有模型的强化学习算法,它通过“访问”许多州配对[Watkins和Dayan,1992年],迭接地完善了对MDP最佳行动价值功能的估计。算法的变式是最近在加强学习方面取得的许多最新最先进成就的核心,其中包括超人阿塔里人深层次的Q-网络[Mnih等人,2015年]。本文的目的是通过“访问”许多州配对[Watkins和Dayan,1992年],对MDP的最佳行动价值功能作出精确和(近距离的)自成一体的证据。许多现有文献都利用强有力的理论来获得这一类中非常普遍的结果。然而,这种方法要求读者熟悉和不同研究领域建立许多深层的联系。一个学生试图加深对Q-学习风险的理解,从而陷入“学习地狱”的恶性循环。为此,我们从一开始就完全地证明,只有利用某种外部结果,即精准的近似的近似,尽管这种最低程度对其他结果的依赖性代价是多少。