Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, we show that the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.
翻译:离线强化学习(RL)的抽样效率保障往往依赖对功能类别(例如Bellman-complete)和数据覆盖范围(例如全政策集中度)的有力假设。 尽管最近努力放松这些假设,但现有工程只能放松其中两个因素之一,使对另一个因素的有力假设保持不变。作为一个重要的开放问题,我们能否在对两个因素的假设薄弱的情况下实现离线强化学习(RL)的抽样效率保障?在本文中,我们回答的是正面的问题。我们分析了基于MDP的初等双轨制的简单算法,其中双重变量(折扣占用率)是使用密度比离线数据模型的功能建模的。我们通过适当的正规化,表明算法具有多数值样本复杂性,仅低于真实性和单项政策集中度。我们还根据不同的假设提供其他分析,以说明离线RL的初等算法的性质。