Reinforcement learning algorithms often require finiteness of state and action spaces in Markov decision processes (MDPs) and various efforts have been made in the literature towards the applicability of such algorithms for continuous state and action spaces. In this paper, we show that under very mild regularity conditions (in particular, involving only weak continuity of the transition kernel of an MDP), Q-learning for standard Borel MDPs via quantization of states and actions converge to a limit, and furthermore this limit satisfies an optimality equation which leads to near optimality with either explicit performance bounds or which are guaranteed to be asymptotically optimal. Our approach builds on (i) viewing quantization as a measurement kernel and thus a quantized MDP as a POMDP, (ii) utilizing near optimality and convergence results of Q-learning for POMDPs, and (iii) finally, near-optimality of finite state model approximations for MDPs with weakly continuous kernels which we show to correspond to the fixed point of the constructed POMDP. Thus, our paper presents a very general convergence and approximation result for the applicability of Q-learning for continuous MDPs.
翻译:强化学习算法往往要求在Markov决策程序(MDPs)中限定州和行动空间,文献中也为这种算法适用于连续状态和行动空间作出了各种努力。在本文中,我们表明,在非常温和的常规条件下(特别是涉及MDP过渡核心的连续性薄弱,尤其是涉及MDP过渡核心的连续性薄弱),通过对州和行动的量化,为标准Borel MDP进行Q学习,使标准Borel MDP达到一个限度,而且这一限度还满足了一种最优化的方程式,这种方程式导致接近最佳性,要么有明确的性能约束,要么保证不具有最佳性能保障。我们的方法基于:(一) 将量化视为测量核心,从而将MDP作为POMDP量化,(二) 利用POMDPsQ学习的接近最佳性和趋同结果,以及(三) 最后,我们所显示的MOMDP与已建的固定点相对应的微连续内核质。因此,我们的文件展示了MOMDP的持续趋同性。