We study reinforcement learning (RL) with linear function approximation. Existing algorithms for this problem only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees, which cannot guarantee the convergence to the optimal policy. In this paper, in order to overcome the limitation of existing algorithms, we propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest.
翻译:我们用线性函数近似值研究强化学习(RL) 。 这一问题的现有算法只有高概率的遗憾和/或很可能是近似正确(PAC)样本复杂性保证,无法保证与最佳政策趋同。 在本文中,为了克服现有算法的局限性,我们提议了一个新的算法,称为FLUTE,它具有与最佳政策的统一-PAC趋同的概率很高。 统一-PAC担保是文献中强化学习的最有力保障,它可能直接暗示PAC和高概率遗憾界限,使我们的算法优于所有现有具有线性函数近似(PAC)的算法。 在我们的算法中,我们的核心是一个新的微缩运算法值估计器和从历史观察中选择培训样本的多层次分割法。 这两种技术都是新的,都具有独立的兴趣。