We study offline multi-agent reinforcement learning (RL) in Markov games, where the goal is to learn an approximate equilibrium -- such as Nash equilibrium and (Coarse) Correlated Equilibrium -- from an offline dataset pre-collected from the game. Existing works consider relatively restricted tabular or linear models and handle each equilibria separately. In this work, we provide the first framework for sample-efficient offline learning in Markov games under general function approximation, handling all 3 equilibria in a unified manner. By using Bellman-consistent pessimism, we obtain interval estimation for policies' returns, and use both the upper and the lower bounds to obtain a relaxation on the gap of a candidate policy, which becomes our optimization objective. Our results generalize prior works and provide several additional insights. Importantly, we require a data coverage condition that improves over the recently proposed "unilateral concentrability". Our condition allows selective coverage of deviation policies that optimally trade-off between their greediness (as approximate best responses) and coverage, and we show scenarios where this leads to significantly better guarantees. As a new connection, we also show how our algorithmic framework can subsume seemingly different solution concepts designed for the special case of two-player zero-sum games.
翻译:我们在Markov游戏中研究离线多试剂强化学习(RL),目的是从游戏前收集的离线数据集中学习近似平衡 -- -- 例如纳什平衡和(粗)Cor-Col-alumlium -- -- 从游戏前收集的离线数据集中学习近似平衡 -- -- 例如纳什平衡和(粗)Cor-Cor-alumlium。现有的作品考虑相对有限的表格式或线性模型,并分别处理每种平衡。在这项工作中,我们为在通用功能近似下,在Markov游戏中抽样高效的离线学习提供了第一个框架,以统一的方式处理所有3种平衡。通过使用Bellman-一致的悲观主义,我们获得了政策回报的间隔估计,并利用上下限和下限,以获得对候选政策差距的宽放,这已成为我们的优化目标。我们的成果一般化了先前的工作,并提供了另外的见解。重要的是,我们需要一个数据覆盖条件,比最近提出的“单方相可调”的组合。我们的条件允许有选择地覆盖偏差政策的范围,在它们的贪婪(作为最好的回应)和覆盖范围,我们所设计的分数的分解法框架。