Offline reinforcement learning (Offline RL) is an emerging field that has recently begun gaining attention across various application domains due to its ability to learn behavior from earlier collected datasets. Using logged data is imperative when further interaction with the environment is expensive (computationally or otherwise), unsafe, or entirely unfeasible. Offline RL proved very successful, paving a path to solving previously intractable real-world problems, and we aim to generalize this paradigm to a multi-agent or multiplayer-game setting. Very little research has been done in this area, as the progress is hindered by the lack of standardized datasets and meaningful benchmarks. In this work, we coin the term offline equilibrium finding (OEF) to describe this area and construct multiple datasets consisting of strategies collected across a wide range of games using several established methods. We also propose a benchmark method -- an amalgamation of a behavior-cloning and a model-based algorithm. Our two model-based algorithms -- OEF-PSRO and OEF-CFR -- are adaptations of the widely-used equilibrium finding algorithms Deep CFR and PSRO in the context of offline learning. In the empirical part, we evaluate the performance of the benchmark algorithms on the constructed datasets. We hope that our efforts may help to accelerate research in large-scale equilibrium finding. Datasets and code are available at https://github.com/SecurityGames/oef.
翻译:离线强化学习(离线 RL)是一个新兴领域,最近由于能够从先前收集的数据集中学习行为,在各个应用领域开始引起人们的注意。当与环境的进一步互动费用昂贵(从计算角度或其他角度)、不安全或完全不可行时,必须使用登录数据。离线RL证明非常成功,为解决以往难以解决的现实世界问题铺平了一条道路,我们的目标是将这一模式推广到多试剂或多玩家游戏环境。由于缺少标准化的数据集和有意义的基准,这方面进展受到阻碍,因此很少开展研究。在这项工作中,我们用“离线平衡发现”这一术语来描述这个区域,并构建多个数据集,包括利用若干既定方法在一系列游戏中收集的战略。我们还提出了一个基准方法 -- -- 将行为曲线和基于模型的算法结合起来。我们的两个基于模型的算法 -- -- OSF-PSRO和OFE-CFR-FR -- -- 是广泛使用的平衡算法的调整,因为缺乏标准化的数据集和有意义的基准基准。我们用“OEFRFR”和“PSRO”来描述这个区域定位的大规模数据定位,我们在数据库中,我们正在从数据库中进行大规模分析。