This paper is concerned with offline reinforcement learning (RL), which learns using pre-collected data without further exploration. Effective offline RL would be able to accommodate distribution shift and limited data coverage. However, prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality, thus posing an impediment to efficient offline RL in sample-starved applications. We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost for tabular Markov decision processes (MDPs). Concretely, consider a finite-horizon (resp. $\gamma$-discounted infinite-horizon) MDP with $S$ states and horizon $H$ (resp. effective horizon $\frac{1}{1-\gamma}$), and suppose the distribution shift of data is reflected by some single-policy clipped concentrability coefficient $C^{\star}_{\text{clipped}}$. We prove that model-based offline RL yields $\varepsilon$-accuracy with a sample complexity of \[ \begin{cases} \frac{H^{4}SC_{\text{clipped}}^{\star}}{\varepsilon^{2}} & (\text{finite-horizon MDPs}) \frac{SC_{\text{clipped}}^{\star}}{(1-\gamma)^{3}\varepsilon^{2}} & (\text{infinite-horizon MDPs}) \end{cases} \] up to log factor, which is minimax optimal for the entire $\varepsilon$-range. The proposed algorithms are ``pessimistic'' variants of value iteration with Bernstein-style penalties, and do not require sophisticated variance reduction. Our analysis framework is established upon delicate leave-one-out decoupling arguments in conjunction with careful self-bounding techniques tailored to MDPs.
翻译:本文关注离线强化学习 (RL), 该选项无需进一步探索就学习使用预收集的数据 。 有效的离线 RL 能够容纳分布转移和有限的数据覆盖。 但是, 先前的算法或分析要么存在亚优性样本复杂性, 要么产生高燃烧成本以达到样本优化性, 从而阻碍在样本萎缩应用程序中高效脱线 RL 。 我们证明基于模型的( 或“ 插入 ” ) 方法在不为列表 Markov 决策程序( MDPs) 刻录成本的情况下实现了最优化的样本复杂性 。 具体地说, 考虑以 $gamma $- dol- dexion 技术( 有效地平线 $\ forc{ 1\\\\\\ gamma} 工具。 假设数据分布变化由某些单政策缩略式的调控调调调的调调系数( $C_ ⁇ stal- text} =cretating $。 我们证明基于模型的 Reximal Reximal- remacial remasi r= rus mice sl= rus r= rmacial= rl= = = = = rmal = = = = = =xxxxxxxxl=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxl=l=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx