The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $\epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $\epsilon$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show that this is not possible -- there exists a fundamental tradeoff between achieving low regret and identifying an $\epsilon$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity -- yielding a complexity which scales with the suboptimality gaps and the ``reachability'' of a state. We show that our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.
翻译:强化学习理论侧重于两个基本问题:实现低遗憾,并找出美元最佳政策。虽然简单削减允许人们采用低回报算法,以获得美元最佳政策并实现最坏情况的最佳比率,但尚不清楚低回报算法能否获得最佳实例确定政策的最佳比率。我们表明,这是不可能的 -- -- 在实现低遗憾和确定以例优率确定美元最佳政策之间存在着根本的权衡。由于我们的负面发现,我们提出了一种基于实例的样本复杂性的新尺度,用于PAC表格强化学习,明确说明基本MDP中可实现的国家访问分布。然后我们提出并分析一种新的、基于规划的、以实例为基础的算法,从而实现这种复杂性 -- -- 产生一种复杂度,其规模与次优化差距和一个国家的“可达性”相比。我们表明,我们的算法几乎是微缩缩缩式最佳,我们基于实例的样本复杂度提出了若干例子,表明我们基于实例的复杂度给人带来最坏的超大改进。