Policy space response oracles (PSRO) is a multi-agent reinforcement learning algorithm that has achieved state-of-the-art performance in very large two-player zero-sum games. PSRO is based on the tabular double oracle (DO) method, an algorithm that is guaranteed to converge to a Nash equilibrium, but may increase exploitability from one iteration to the next. We propose anytime double oracle (ADO), a tabular double oracle algorithm for 2-player zero-sum games that is guaranteed to converge to a Nash equilibrium while decreasing exploitability from one iteration to the next. Unlike DO, in which the restricted distribution is based on the restricted game formed by each player's strategy sets, ADO finds the restricted distribution for each player that minimizes its exploitability against any policy in the full, unrestricted game. We also propose a method of finding this restricted distribution via a no-regret algorithm updated against best responses, called RM-BR DO. Finally, we propose anytime PSRO (APSRO), a version of ADO that calculates best responses via reinforcement learning. In experiments on Leduc poker and random normal form games, we show that our methods achieve far lower exploitability than DO and PSRO and decrease exploitability monotonically.
翻译:政策空间反应或触角(PSRO)是一种多试剂强化学习算法,在非常大型的双玩者零和游戏中实现了最先进的性能。 PSRO是基于表单双甲骨(DO)方法的,该算法保证与纳什均衡趋同,但可能会增加从一个迭代到另一个迭代的利用能力。我们提议一个表单双甲骨(ADO)法,用于两玩者零和游戏的双甲骨(ADO),该算法保证与纳什平衡趋近,同时降低从一个迭代到下一个迭代的利用能力。与DO不同,在DO中,限制的分布是基于每个玩家战略组形成的有限游戏,ADO发现每个玩家的有限分配方式,这种分配方式在完全、无限制的游戏中最大限度地减少其利用性能,但在任何政策下,我们还提出一种方法,即通过一个不重复的算法,根据最佳反应,称为RM-BRDRDO(APRO) 。最后,我们提议一个ADRO(APSRRO),这个版本,通过强化学习来计算最佳反应最佳的反应,通过最佳反应,通过最优的学习来计算出最佳的反应。在更低的利用能力和低的游戏中,我们展示。我们在的游戏中,我们展示和任意的利用能力,我们展示的利用性,我们展示的游戏中,我们显示比普通的利用能力。