Offline or batch reinforcement learning seeks to learn a near-optimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of model-based algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their model-free counterparts -- which do not require explicit model estimation -- have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the single-policy concentrability assumption which does not require the full coverage of the state-action space. In addition, a variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity. Altogether, this work highlights the efficiency of model-free algorithms in offline RL when used in conjunction with pessimism and variance reduction.
翻译:离线或分批强化学习是为了学习一种使用历史数据而接近最佳的政策,而不积极探索环境。为了应对许多离线数据集覆盖面不足和抽样稀缺的问题,最近引入了悲观主义原则,以缓解估计值的高偏差。虽然对基于模型的算法的悲观变体(例如,低信任度的数值迭代)进行了理论上的调查,但对不使用模型的对等方(不需要明确的模型估计)没有进行充分研究,特别是在抽样效率方面。为解决这一不足,我们研究了在限定-horizon Markov决策过程中Q学习的悲观变体,并在单政策相容性假设下描述其抽样复杂性,而这一假设并不要求完全覆盖状态-行动空间。此外,还提议采用一种有差异的悲观性Q-学习算法,以达到近于最佳的样本复杂性。总体而言,这项工作突出了在与悲观主义和缩小差异时使用离线的无模型算法的效率。