This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider $\gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $\mathcal{S}$ and action space $\mathcal{A}$. Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2}$. The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of $\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}$ (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).
翻译:本文关注强化学习的样本效率, 假设使用基因模型( 或模拟器) 。 我们首先考虑 $\ gamma$- 折扣的无限- horizon Markov 决策程序 (MDPs), 州空间$\ mathcal{S}$ 美元 和动作空间$\ mathcal{A}} 美元。 尽管先前做了一些研究来解决这个问题, 但仍无法确定样本复杂性和统计准确性之间取舍的完整情况。 特别是, 所有先前的结果都存在一个严重的样本大小障碍, 也就是说, 他们声称的统计保证只有在样本大小至少超过 $\ frac_ mathcal{ mathcal{ S\\\ mathcal{S\\ mathcal{S\\\ macal{A\\\\\\\\\\\\\\\\ mac\ mac\ mac\ macrealalalalal as a real- m im exmal exmal exmal exmal ormaxal ormax max a or or exm) as mexm exm max exmal max ors misals misals mismax。 当样本规模的样本规模规模规模规模规模规模规模规模规模大小超过多少, maxxx, exmal- sal- lexxxxx, maxal- sal- maxal- maxx maxal- maxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, exxxxxxxxxxxxx, exxxx