Modern reinforcement learning (RL) often faces an enormous state-action space. Existing analytical results are typically for settings with a small number of state-actions, or simple models such as linearly modeled Q-functions. To derive statistically efficient RL policies handling large state-action spaces, with more general Q-functions, some recent works have considered nonlinear function approximation using kernel ridge regression. In this work, we derive sample complexities for kernel based Q-learning when a generative model exists. We propose a nonparametric Q-learning algorithm which finds an $\epsilon$-optimal policy in an arbitrarily large scale discounted MDP. The sample complexity of the proposed algorithm is order optimal with respect to $\epsilon$ and the complexity of the kernel (in terms of its information gain). To the best of our knowledge, this is the first result showing a finite sample complexity under such a general model.
翻译:现代强化学习(RL)往往面临巨大的状态行动空间。 现有的分析结果通常针对数量较少的状态行动或线性模型Q功能等简单模型的设置。 要获得具有统计效率的RL政策, 处理大型州行动空间, 具有更一般的Q功能, 最近的一些作品已经考虑使用内核脊脊回归法来进行非线性功能近似。 在这项工作中, 当存在基因化模型时, 我们为以内核为基础的Q学习获取样本复杂性。 我们提出非参数的Q学习算法, 在任意大规模折扣的MDP中发现一个$\ epsilon$- 最佳政策。 拟议算法的样本复杂性在$\ epslon$ 和 内核的复杂程度( 信息收益) 方面是最佳的。 据我们所知, 这是在这种通用模型下显示有限样本复杂性的第一个结果。