Offline Reinforcement Learning (RL) enables policy improvement from fixed datasets without online interactions, making it highly suitable for real-world applications lacking efficient simulators. Despite its success in the single-agent setting, offline multi-agent RL remains a challenge, especially in competitive games. Firstly, unaware of the game structure, it is impossible to interact with the opponents and conduct a major learning paradigm, self-play, for competitive games. Secondly, real-world datasets cannot cover all the state and action space in the game, resulting in barriers to identifying Nash equilibrium (NE). To address these issues, this paper introduces OFF-FSP, the first practical model-free offline RL algorithm for competitive games. We start by simulating interactions with various opponents by adjusting the weights of the fixed dataset with importance sampling. This technique allows us to learn the best responses to different opponents and employ the Offline Self-Play learning framework. To overcome the challenge of partial coverage, we combine the single-agent offline RL method with Fictitious Self-Play (FSP) to approximate NE by constraining the approximate best responses away from out-of-distribution actions. Experiments on matrix games, extensive-form poker, and board games demonstrate that OFF-FSP achieves significantly lower exploitability than state-of-the-art baselines. Finally, we validate OFF-FSP on a real-world human-robot competitive task, demonstrating its potential for solving complex, hard-to-simulate real-world problems.
翻译:离线强化学习(RL)能够从固定数据集中进行策略改进,无需在线交互,这使其非常适合缺乏高效仿真器的现实世界应用。尽管在单智能体环境中取得了成功,但离线多智能体强化学习仍然是一个挑战,尤其是在竞争性游戏中。首先,由于不了解游戏结构,无法与对手交互并执行竞争性游戏的主要学习范式——自博弈。其次,现实世界的数据集无法覆盖游戏中的所有状态和动作空间,导致识别纳什均衡(NE)存在障碍。为解决这些问题,本文提出了OFF-FSP,这是首个适用于竞争性游戏的实用无模型离线强化学习算法。我们首先通过重要性采样调整固定数据集的权重,模拟与不同对手的交互。该技术使我们能够学习针对不同对手的最佳响应,并采用离线自博弈学习框架。为克服部分覆盖的挑战,我们将单智能体离线强化学习方法与虚拟自博弈(FSP)相结合,通过约束近似最佳响应远离分布外动作来逼近纳什均衡。在矩阵游戏、扩展式扑克和棋盘游戏上的实验表明,OFF-FSP的利用性显著低于最先进的基线方法。最后,我们在真实世界的人机竞争任务上验证了OFF-FSP,证明了其解决复杂、难以仿真的现实世界问题的潜力。