In competitive two-agent environments, deep reinforcement learning (RL) methods based on the \emph{Double Oracle (DO)} algorithm, such as \emph{Policy Space Response Oracles (PSRO)} and \emph{Anytime PSRO (APSRO)}, iteratively add RL best response policies to a population. Eventually, an optimal mixture of these population policies will approximate a Nash equilibrium. However, these methods might need to add all deterministic policies before converging. In this work, we introduce \emph{Self-Play PSRO (SP-PSRO)}, a method that adds an approximately optimal stochastic policy to the population in each iteration. Instead of adding only deterministic best responses to the opponent's least exploitable population mixture, SP-PSRO also learns an approximately optimal stochastic policy and adds it to the population as well. As a result, SP-PSRO empirically tends to converge much faster than APSRO and in many games converges in just a few iterations.
翻译:在有竞争力的双试剂环境中,基于 emph{Double Oracle (DO) 的深强化学习(RL) 算法,例如 emph{ Policy Space Respace Oracles (PSRO)} 和\ emph{PSRO (PSRO)}, 反复地将最佳反应政策添加到人群中。 最后, 这些人口政策的最佳组合将接近于纳什均衡。 然而, 这些方法可能需要在凝聚之前添加所有确定性政策。 在这项工作中, 我们引入了 emph{ self-Play PSRO (SP-PSRO)}, 这是一种在每次循环中为人群添加大约最佳的随机化政策。 SP- PSRO 不仅对对手最不易被利用的人口混合物添加了最优化的确定性最佳反应, 而且还学习了一种最优化的随机化政策, 并且将其添加到人群中。 结果,SP- PSRO 经验上倾向于比 PCRO 和许多游戏的融合速度要快得多地相交汇。