Offline reinforcement learning (RL), which refers to decision-making from a previously-collected dataset of interactions, has received significant attention over the past years. Much effort has focused on improving offline RL practicality by addressing the prevalent issue of partial data coverage through various forms of conservative policy learning. While the majority of algorithms do not have finite-sample guarantees, several provable conservative offline RL algorithms are designed and analyzed within the single-policy concentrability framework that handles partial coverage. Yet, in the nonlinear function approximation setting where confidence intervals are difficult to obtain, existing provable algorithms suffer from computational intractability, prohibitively strong assumptions, and suboptimal statistical rates. In this paper, we leverage the marginalized importance sampling (MIS) formulation of RL and present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability, bypassing the need for uncertainty quantification. We identify that the key to successfully solving the sample-based approximation of the MIS problem is ensuring that certain occupancy validity constraints are nearly satisfied. We enforce these constraints by a novel application of the augmented Lagrangian method and prove the following result: with the MIS formulation, augmented Lagrangian is enough for statistically optimal offline RL. In stark contrast to prior algorithms that induce additional conservatism through methods such as behavior regularization, our approach provably eliminates this need and reinterprets regularizers as "enforcers of occupancy validity" than "promoters of conservatism."
翻译:离线强化学习(RL)是指从先前收集的相互作用数据集中做出决策,在过去几年中,这种离线强化学习(RL)受到极大关注。许多努力的重点是通过各种形式的保守政策学习解决局部数据覆盖这一普遍问题,从而改善离线RL的实用性。虽然大多数算法没有限定的抽样保证,但在处理部分覆盖的单一政策集中化框架内,设计并分析了一些可证实的离线的保守的离线 RL 算法。然而,在难以获得信任间隔的非线性功能近似设置中,现有的可调控算法受到计算不易、过于强烈的假设和不优化的统计率的影响。在本论文中,我们利用RL的边缘化重要性抽样(MIS)的配法,提出了在一般功能近似和单一政策集中化下最优化和实用的第一组离线 RL 运算法,绕过不确定性量化的需要。我们确定成功解决MIL问题基于样本的近似性近似性的近似性估算的关键在于确保某些占用性约束性、令人窒息的假设性假设性假设性以及非最佳的统计率率率率。我们通过新的方法,通过改进了这些先变压方法,从而加强了了这些抑制的压制的压制方法。