Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimality has only been (nearly) established for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose a new pessimism-based algorithm for offline linear MDP. At the core of our algorithm is the uncertainty decomposition via a reference function, which is new in the literature of offline RL under linear function approximation. Theoretical analysis demonstrates that our algorithm can match the performance lower bound up to logarithmic factors. We also extend our techniques to the two-player zero-sum Markov games (MGs), and establish a new performance lower bound for MGs, which tightens the existing result, and verifies the nearly minimax optimality of the proposed algorithm. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation.
翻译:离线强化学习(RL)旨在利用预先收集的数据集学习最佳战略,而不与环境进一步互动。虽然在以前的文献中已经为离线RL提出了各种算法,但小型算法只为表格Markov决定程序(MDPs)建立了(近距离) 。在本文中,我们侧重于离线RL,使用线性功能近似值,并为离线线线性MDP提出一个新的悲观性算法。在我们算法的核心是通过参考函数的不确定性分解,这是线性函数近似下线性RL文献中新的。理论分析表明,我们的算法可以将较低约束的性能与对数系数相匹配。我们还将我们的技术扩大到两个玩家零和马尔科夫游戏(MGs),并为MGs设定一个新的更低的性能约束,以收紧现有结果,并验证拟议算法的近微度最佳性能。据我们所知,这些是首次计算高效的、近乎小型的离线性微量最佳算法。</s>