Policy Space Response Oracle method (PSRO) provides a general solution to Nash equilibrium in two-player zero-sum games but suffers from two problems: (1) the computation inefficiency due to consistently evaluating current populations by simulations; and (2) the exploration inefficiency due to learning best responses against a fixed meta-strategy at each iteration. In this work, we propose Efficient PSRO (EPSRO) that largely improves the efficiency of the above two steps. Central to our development is the newly-introduced subroutine of minimax optimization on unrestricted-restricted (URR) games. By solving URR at each step, one can evaluate the current game and compute the best response in one forward pass with no need for game simulations. Theoretically, we prove that the solution procedures of EPSRO offer a monotonic improvement on exploitability. Moreover, a desirable property of EPSRO is that it is parallelizable, this allows for efficient exploration in the policy space that induces behavioral diversity. We test EPSRO on three classes of games and report a 50x speedup in wall-time, 10x data efficiency, and similar exploitability as existing PSRO methods on Kuhn and Leduc Poker games.
翻译:空间政策应对甲骨文方法(PSRO)为两玩零和游戏中的纳什平衡提供了一个总体解决方案,但有两个问题:(1) 计算效率低下,因为通过模拟对当前人口进行持续评估;(2) 探索效率低下,因为学习对每种迭代的固定元战略作出最佳反应;在这项工作中,我们建议高效的PSRO(EPSRO),主要提高上述两个步骤的效率;我们发展的核心是新引入的无限制(URR)游戏小型马克斯优化亚常规。通过每一步解决URRR,人们可以评估当前的游戏并计算最佳反应,而无需进行游戏模拟。理论上,我们证明EPSRO的解决方案对利用能力提供了单一的改进。此外,EPSRO的可取性特征是,它可以平行地在引发行为多样性的政策空间中进行高效探索。我们测试EPSRO的三个游戏类别,并报告在墙时50x的PRO游戏速度、10x数据效率和类似开发能力,作为移动时的勒克斯数据方法。