Various algorithms for reinforcement learning (RL) exhibit dramatic variation in their convergence rates as a function of problem structure. Such problem-dependent behavior is not captured by worst-case analyses and has accordingly inspired a growing effort in obtaining instance-dependent guarantees and deriving instance-optimal algorithms for RL problems. This research has been carried out, however, primarily within the confines of theory, providing guarantees that explain \textit{ex post} the performance differences observed. A natural next step is to convert these theoretical guarantees into guidelines that are useful in practice. We address the problem of obtaining sharp instance-dependent confidence regions for the policy evaluation problem and the optimal value estimation problem of an MDP, given access to an instance-optimal algorithm. As a consequence, we propose a data-dependent stopping rule for instance-optimal algorithms. The proposed stopping rule adapts to the instance-specific difficulty of the problem and allows for early termination for problems with favorable structure.
翻译:强化学习的各种算法(RL)作为问题结构的函数,其趋同率差异很大。这种依赖问题的行为没有在最坏情况分析中反映出来,因此激励人们作出越来越多的努力,以获得以实例为依据的保证,并得出以实例为最优的算法解决RL问题。然而,这一研究主要在理论范围内进行,提供了解释所观察到的绩效差异的保证。一个自然的下一步是将这些理论保证转化为实际有用的准则。我们处理的问题是如何为政策评价问题和MDP的最佳价值估计问题获得尖锐依赖实例的信任区,并提供了使用实例最佳算法的机会。结果,我们建议对实例最佳算法提出以数据为依据的停止规则。拟议的停止规则适应问题的具体实例困难,并允许对有利结构的问题提前终止。