This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider $\gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $\mathcal{S}$ and action space $\mathcal{A}$. Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2}$. The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of $\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}$ (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).
翻译:本文关注使用生成模型(或模拟器)的强化学习的样本效率。我们首先考虑状态空间为 $\mathcal{S}$,行动空间为 $\mathcal{A}$ 的 $\gamma$ 折扣无限时间马尔可夫决策过程(MDP)。尽管有许多先前的研究致力于解决这个问题,但样本复杂度和统计精度之间的权衡的完整图像还未确定。特别地,所有以前的结果都受到严重的样本大小限制,也就是说,他们声称的统计保证仅在样本大小大于等于 $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2}$ 时成立。本文通过保证两种算法(扰动模型为基础的算法和保守模型为基础的算法)最小化机制的优化,来突破这一障碍,只要样本大小超过 $\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}$(模一些对数因子)即可。超越无限时间 MDP,我们进一步研究时变有限时间 MDP,并证明了纯模型为基础的规划算法足以在任何目标准确度水平下实现最小化机制的样本复杂度。据我们所知,本文提供了第一个适应于整个样本大小范围(超出这个范围,发现有意义的策略是信息理论上不可行的)的最小化机制保证。