We consider a challenging theoretical problem in offline reinforcement learning (RL): obtaining sample-efficiency guarantees with a dataset lacking sufficient coverage, under only realizability-type assumptions for the function approximators. While the existing theory has addressed learning under realizability and under non-exploratory data separately, no work has been able to address both simultaneously (except for a concurrent work which we compare in detail). Under an additional gap assumption, we provide guarantees to a simple pessimistic algorithm based on a version space formed by marginalized importance sampling, and the guarantee only requires the data to cover the optimal policy and the function classes to realize the optimal value and density-ratio functions. While similar gap assumptions have been used in other areas of RL theory, our work is the first to identify the utility and the novel mechanism of gap assumptions in offline RL with weak function approximation.
翻译:我们认为,离线强化学习(RL)是一个具有挑战性的理论问题:获得抽样效率保障,其数据集覆盖面不足,仅以可变性类型假设为基础,对功能相近者而言,仅以功能相近者而言;虽然现有理论分别涉及在可变性和非探索性数据条件下的学习,但没有一项工作能够同时解决这两个问题(我们详细比较的并行工作除外);在另一个差距假设下,我们为基于边缘化重要性抽样形成的版本空间的简单悲观算法提供了保障,而这种保证仅要求数据涵盖最佳政策和功能类别,以实现最佳价值和密度-纬度功能。 虽然在RL理论的其他领域也使用了类似的差距假设,但我们的工作是首先确定功能近似于功能的离线下模型的实用性和新机制。