Offline reinforcement learning (offline RL) considers problems where learning is performed using only previously collected samples and is helpful for the settings in which collecting new data is costly or risky. In model-based offline RL, the learner performs estimation (or optimization) using a model constructed according to the empirical transition frequencies. We analyze the sample complexity of vanilla model-based offline RL with dependent samples in the infinite-horizon discounted-reward setting. In our setting, the samples obey the dynamics of the Markov decision process and, consequently, may have interdependencies. Under no assumption of independent samples, we provide a high-probability, polynomial sample complexity bound for vanilla model-based off-policy evaluation that requires partial or uniform coverage. We extend this result to the off-policy optimization under uniform coverage. As a comparison to the model-based approach, we analyze the sample complexity of off-policy evaluation with vanilla importance sampling in the infinite-horizon setting. Finally, we provide an estimator that outperforms the sample-mean estimator for almost deterministic dynamics that are prevalent in reinforcement learning.
翻译:离线强化学习(离线 RL) 考虑的是,在学习过程中,仅使用先前收集的样本进行学习的问题,对于收集新数据的成本或风险环境是有助益的。在基于模型的离线RL中,学习者使用根据经验过渡频率构建的模式进行估计(或优化)。我们分析了香草模型离线学习的样本复杂性,在无穷度折扣降价环境下有依赖性的样本。在设定中,样本符合Markov决定过程的动态,因此可能具有相互依存性。在不假定独立样本的情况下,我们提供了一种高概率、多元样本复杂性,用于需要部分或统一覆盖的香草模型离政策评估。我们将这一结果推广到统一覆盖下的脱政策优化。作为与基于模型的方法的比较,我们分析了非政策评估的样本复杂性,在无穷度休养分层设置中,我们提供了一种估测算器,它比样本中标定值的精度高,因为几乎具有确定性动态,在加强学习中十分普遍。</s>