Offline reinforcement learning (RL) concerns pursuing an optimal policy for sequential decision-making from a pre-collected dataset, without further interaction with the environment. Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators, especially to handle the case with excessively large state-action spaces. Among them, the framework based on the linear-programming (LP) reformulation of Markov decision processes has shown promise: it enables sample-efficient offline RL with function approximation, under only partial data coverage and realizability assumptions on the function classes, with favorable computational tractability. In this work, we revisit the LP framework for offline RL, and advance the existing results in several aspects, relaxing certain assumptions and achieving optimal statistical rates in terms of sample size. Our key enabler is to introduce proper constraints in the reformulation, instead of using any regularization as in the literature, sometimes also with careful choices of the function classes and initial state distributions. We hope our insights further advocate the study of the LP framework, as well as the induced primal-dual minimax optimization, in offline RL.
翻译:离线强化学习(RL)关注的是,在不与环境进一步互动的情况下,从收集前的数据集中推行最佳的顺序决策政策,而没有与环境进一步互动。最近的理论进展侧重于开发抽样高效的离线RL算法,对数据覆盖范围和功能相近者有各种宽松的假设,特别是处理州一级行动空间过大的案件。其中,基于线性方案(LP)重新拟订Markov决策程序的框架显示了希望:它只允许在部分数据覆盖面和功能类别现实假设下,以功能近似方式,在功能类别上,在部分数据覆盖面和真实性假设下,以有利于计算性为主。在这项工作中,我们重新审视离线RLLL框架,并在几个方面推进现有成果,放松某些假设,在样本大小方面达到最佳统计率。我们的关键推动因素是在重新拟订时引入适当的限制,而不是在文献中使用任何规范,有时还要仔细选择功能类别和初始状态分布。我们希望我们的洞察力进一步倡导LP框架的研究,以及引导出的原始-LAximal微缩压。