We consider a challenging theoretical problem in offline reinforcement learning (RL): obtaining sample-efficiency guarantees with a dataset lacking sufficient coverage, under only realizability-type assumptions for the function approximators. While the existing theory has addressed learning under realizability and under non-exploratory data separately, no work has been able to address both simultaneously (except for a concurrent work which we compare in detail). Under an additional gap assumption, we provide guarantees to a simple pessimistic algorithm based on a version space formed by marginalized importance sampling (MIS), and the guarantee only requires the data to cover the optimal policy and the function classes to realize the optimal value and density-ratio functions. While similar gap assumptions have been used in other areas of RL theory, our work is the first to identify the utility and the novel mechanism of gap assumptions in offline RL with weak function approximation.
翻译:我们认为,在离线强化学习(RL)中存在一个具有挑战性的理论问题:获得抽样效率保障,其数据集覆盖面不足,仅以可变性类型假设为基础,对功能近似者而言,仅以功能近似者而言,仅以可变性假设为基础;虽然现有理论分别涉及在可变性和非探索性数据之下学习,但没有任何工作能够同时解决这两个问题(我们详细比较的并行工作除外);在另一个差距假设下,我们为基于边缘化重要性抽样(MIS)形成的版本空间的简单悲观算法提供了保障,而这种保证只要求数据涵盖最佳政策和功能类别,以实现最佳值和密度比值功能。 虽然在RL理论的其他领域也使用了类似的差距假设,但我们的工作首先确定了离线性RL和功能微弱近近的离线外差假设的效用和新机制。