高效率的政策空间应对空间政策 (Efficient Policy Space Response Oracles)

Policy Space Response Oracle methods (PSRO) provide a general solution to learn Nash equilibrium in two-player zero-sum games but suffer from two drawbacks: (1) the computation inefficiency due to the need for consistent meta-game evaluation via simulations, and (2) the exploration inefficiency due to finding the best response against a fixed meta-strategy at every epoch. In this work, we propose Efficient PSRO (EPSRO) that largely improves the efficiency of the above two steps. Central to our development is the newly-introduced subroutine of no-regret optimization on the unrestricted-restricted (URR) game. By solving URR at each epoch, one can evaluate the current game and compute the best response in one forward pass without the need for meta-game simulations. Theoretically, we prove that the solution procedures of EPSRO offer a monotonic improvement on the exploitability, which none of existing PSRO methods possess. Furthermore, we prove that the no-regret optimization has a regret bound of $\mathcal{O}(\sqrt{T\log{[(k^2+k)/2]}})$, where $k$ is the size of restricted policy set. Most importantly, a desirable property of EPSRO is that it is parallelizable, this allows for highly efficient exploration in the policy space that induces behavioral diversity. We test EPSRO on three classes of games, and report a 50x speedup in wall-time and 10x data efficiency while maintaining similar exploitability as existing PSRO methods on Kuhn and Leduc Poker games.

翻译：政策空间反应甲骨文方法(PSRO)为在双球零和游戏中学习纳什平衡提供了一个总体解决方案,但有两个缺点:(1) 计算效率低下,因为需要通过模拟进行一致的元球评价;(2) 探索效率低下,因为在每个时代都找到最佳对策来应对固定的元战略。在这项工作中,我们提出高效的PSRO(EPSRO)方案(EPSRO)方案在很大程度上提高了上述两个步骤的效率。我们发展的核心是,在不受限制的(URR)游戏中,新引入的无回报优化的亚路径。通过在每一个时代解决铀红外运动(URR),人们可以评估当前的游戏,并在一个远球中计算出最佳的响应,而不需要进行元游戏模拟。理论上,我们证明EPSRO方案(ESRO)的解决方案程序在利用能力方面是一个单一的改进,而现有的PSRO方法没有一个拥有。此外,我们证明,不重复的优化的游戏(ORO) 3级(UPRO) (O_O) 和高压度的透明性政策SQ(EQ) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (的) (高度的) (O) (高度) (的) (高度的) (O) (O) (O) (的) (O) () () () () () () () (O) (的) () () (高度) (高度) (O) (O) (O) () (O) () () () () () () () () () () () () () () () () () () () () () () () () () () () (高度) () (