In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $\epsilon$-optimal policy with probability $1-\delta$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first nearly matching (up to a horizon squared factor and logarithmic terms) upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts (minimum flows) and a new maximum-coverage exploration strategy.
翻译:大致正确( PAC) 强化学习( RL ), 需要一个代理方来确定一个 $\ epsilon$- 最佳政策, 概率为 1\ delta$ 。 虽然对于这一问题存在迷你最大最佳算法, 但它的依实例复杂性在单数 Markov 决策程序( MDPs ) 中仍然难以实现。 在本文中, 我们建议了第一个接近于 PAC RL 的样本复杂性( 直至一个平方度系数和对数术语) 的上限和下限, 在具有有限状态和行动空间的确定性单数性单数 MDP 中。 特别是, 我们的界限为州际行动对子设定了一个亚最佳性差距的新概念, 我们称之为确定性回报差距 。 虽然我们依实例的下限是一个线性程序, 我们的算法非常简单, 不需要在学习时解决这种优化问题。 它们的设计和分析采用了新颖的概念, 包括图形- 理论概念( 最小流) 和一个新的最大覆盖战略 。