In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $\epsilon$-optimal policy with probability $1-\delta$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts, which we believe to shed new light on this problem.
翻译:大致正确( PAC) 强化学习( RL ), 需要一个代理方来确定美元( epsilon) 最佳政策, 概率为 1 美元( delta ) 。 虽然存在这一问题的迷你最佳算法, 但它的依实例复杂性在附带的 Markov 决策程序( MDPs ) 中仍然难以实现。 在本文中, 我们提议了第一个( 近乎) 匹配PAC RL 样本复杂性的上限和下限, 即确定性分数 MDP 和 有限状态和 动作空间的样本复杂性。 特别是, 我们的界限为州际行动配对提供了一个亚最佳性差距的新概念, 我们称之为确定性回报差距。 虽然我们依实例的较低约束以线性程序写成, 我们的算法非常简单, 不需要在学习过程中解决这种优化问题。 它们的设计和分析采用了新颖的概念, 包括像最小流量和最大削减量的图形理论概念, 我们认为这些新概念能对这个问题产生新观点。