Thompson Sampling is one of the most effective methods for contextual bandits and has been generalized to posterior sampling for certain MDP settings. However, existing posterior sampling methods for reinforcement learning are limited by being model-based or lack worst-case theoretical guarantees beyond linear MDPs. This paper proposes a new model-free formulation of posterior sampling that applies to more general episodic reinforcement learning problems with theoretical guarantees. We introduce novel proof techniques to show that under suitable conditions, the worst-case regret of our posterior sampling method matches the best known results of optimization based methods. In the linear MDP setting with dimension, the regret of our algorithm scales linearly with the dimension as compared to a quadratic dependence of the existing posterior sampling-based exploration algorithms.
翻译:汤普森取样是针对背景强盗的最有效方法之一,并被广泛推广到某些多用途方案环境的后继取样中,但是,现有的强化学习的后继取样方法由于以模型为基础或除线性多用途方案外缺乏最坏的理论保障而受到限制。本文建议采用一种新的无模式的后继取样方法,这种模式适用于更一般的偶发强化学习问题,并带有理论保障。我们采用了新颖的证明技术,以表明在适当条件下,我们的后继取样方法最差的遗憾与最佳优化方法的已知结果相匹配。在线性多用途方案设置中,我们算法尺度的遗憾与现有远代抽样勘探算法的二次依赖性相比是线性。