We consider the problem of model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernels, when only a single sample path of the system is available. We focus on the classical approach of Q-learning where the goal is to learn the optimal Q-function. We propose the Nearest Neighbor Q-Learning approach that utilizes nearest neighbor regression method to learn the Q function. We provide finite sample analysis of the convergence rate using this method. In particular, we establish that the algorithm is guaranteed to output an $\epsilon$-accurate estimate of the optimal Q-function with high probability using a number of observations that depends polynomially on $\epsilon$ and the model parameters. To establish our results, we develop a robust version of stochastic approximation results; this may be of interest in its own right.
翻译:我们考虑的是,在系统只有一个样本路径的情况下,对具有连续状态空间和未知的过渡内核的无限象子折扣Markov 决策进程(MDPs)进行无模型强化学习的问题。我们注重Q学习的典型方法,目标是学习最佳功能。我们建议采用近邻学习方法,利用最近的邻居回归方法学习Q函数。我们用这种方法对趋同率进行有限的抽样分析。特别是,我们确定算法保证使用多种观测,以多货币方式依赖$\epsilon和模型参数,高概率输出对最佳Q功能的精确估计值。为了确定我们的结果,我们开发了一个可靠的随机近距离近距离近距离近距离近距离近距离近距离近距离近距离接近结果,以学习Q函数。我们用这种方法对趋同率进行有限的抽样分析。我们确定算法对于它本身的权利可能很感兴趣。