Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimal performance has only been (nearly) achieved for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose two new algorithms, SPEVI+ and SPMVI+, for single-agent MDPs and two-player zero-sum Markov games (MGs), respectively. The proposed algorithms feature carefully crafted data splitting mechanisms and novel variance-reduction pessimistic estimators. Theoretical analysis demonstrates that they are capable of matching the performance lower bounds up to logarithmic factors. As a byproduct, a new performance lower bound is established for MGs, which tightens the existing results. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation.
翻译:离线强化学习(RL)旨在利用预先收集的数据集学习最佳战略,而不与环境进一步互动。虽然在以前的文献中已经为离线RL提出了各种算法,但表格Markov决定程序(MDPs)只实现了(近距离)最低最佳性能。在本文中,我们侧重于具有线性功能近似值的离线RL,并提出了两种新的算法,即SPEVI+和SPMVI+,分别用于单试MDPs和双玩者马可夫游戏(MGss)。拟议的算法具有精心制作的数据分离机制和新的差异减少悲观估测器的特点。理论分析表明,它们能够将性能较低的边框匹配到对数系数。作为副产品,为MGs设定了新的低性能约束,它收紧了现有的结果。据我们所知,这些是用于离线单一试MDPs和MGs的首次计算高效和近乎微量最佳算法。