Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable general method for finding approximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy. We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of $10^{50}$. P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots.
翻译:在零和不完善的信息游戏中,在零和不完善的信息游戏中找到近似纳什平衡是困难的。 政策空间反应甲骨文(PSRO)是建立在游戏理论基础上的深入强化学习算法,它保证会达到接近纳什均衡。 但是,PSRO需要在每个迭代中培训强化学习政策,使大型游戏的进度过于缓慢。 我们通过反比和实验显示,DCH和校正的PSRO是扩大PSRO的两种现有方法,即使在小游戏中也未能趋同。 我们引入了管道PSRO(P2SRO),这是在大规模零和不完善的信息游戏中找到近似Nash equilimaria(PSRO)的第一个可伸缩的一般方法。 P2SRO通过保持一个强化学习工作者的分级管道和实验来将PSRO与趋同起来的保证平行。 我们显示,与现有方法不同,P2SROO是接近近乎纳什平衡的,随着平行工人人数的增加,在各种不完美的信息游戏中,我们还引入了一种开放的SBRAG的组合环境。