We propose a general framework to design posterior sampling methods for model-based RL. We show that the proposed algorithms can be analyzed by reducing regret to Hellinger distance in conditional probability estimation. We further show that optimistic posterior sampling can control this Hellinger distance, when we measure model error via data likelihood. This technique allows us to design and analyze unified posterior sampling algorithms with state-of-the-art sample complexity guarantees for many model-based RL settings. We illustrate our general result in many special cases, demonstrating the versatility of our framework.
翻译:我们为基于模型的RL提出了设计后方取样方法的一般框架。 我们表明,可以通过降低对Hellinger距离的遗憾,在有条件的概率估计中分析拟议的算法。 我们还表明,当我们通过数据可能性来测量模型错误时,乐观的后方取样可以控制这一Hellinger距离。这种技术使我们能够设计和分析具有许多基于模型的RL设置的最新样本复杂性保障的统一后方取样算法。我们在许多特殊情况下展示了我们的总体结果,显示了我们框架的多功能性。