We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes. Compared to prior work, our bounds depend on alternative definitions of gaps. These definitions are based on the insight that, in order to achieve a favorable regret, an algorithm does not need to learn how to behave optimally in states that are not reached by an optimal policy. We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs. Our results show that optimistic algorithms can not achieve the information-theoretic lower bounds even in deterministic MDPs unless there is a unique optimal policy.
翻译:我们提供了更好的依赖差距的遗憾界限,用于在有限的附带性马尔科夫决策程序中强化学习。与先前的工作相比,我们的界限取决于对差距的替代定义。这些定义是基于这样的洞察力,即为了实现一种有利的遗憾,算法不需要学会如何在没有最佳政策达到的状态中最佳地行事。我们证明对乐观算法的上层遗憾界限更加严格,并伴随它们为一大批MDP提供新的信息理论较低界限。我们的结果表明,即使没有独特的最佳政策,乐观的算法也无法达到信息理论较低界限,即使是在确定性 MDP中也是如此。