Algorithms designed for single-agent reinforcement learning (RL) generally fail to converge to equilibria in two-player zero-sum (2p0s) games. Conversely, game-theoretic algorithms for approximating Nash and quantal response equilibria (QREs) in 2p0s games are not typically competitive for RL and can be difficult to scale. As a result, algorithms for these two cases are generally developed and evaluated separately. In this work, we show that a single algorithm -- a simple extension to mirror descent with proximal regularization that we call magnetic mirror descent (MMD) -- can produce strong results in both settings, despite their fundamental differences. From a theoretical standpoint, we prove that MMD converges linearly to QREs in extensive-form games -- this is the first time linear convergence has been proven for a first order solver. Moreover, applied as a tabular Nash equilibrium solver via self-play, we show empirically that MMD produces results competitive with CFR in both normal-form and extensive-form games with full feedback (this is the first time that a standard RL algorithm has done so) and also that MMD empirically converges in black-box feedback settings. Furthermore, for single-agent deep RL, on a small collection of Atari and Mujoco games, we show that MMD can produce results competitive with those of PPO. Lastly, for multi-agent deep RL, we show MMD can outperform NFSP in 3x3 Abrupt Dark Hex.
翻译:用于单试剂强化学习( RL) 的演算法通常无法在两个玩家零和( 2p0) 游戏中趋同到平衡。 相反, 2p0 游戏中接近纳什和四等反应平衡( QREs) 的游戏理论算法在 2p0 游戏中一般不具有竞争力, 并且可能难以缩放。 因此, 这两个案例的算法一般是单独制定和评估的。 在这项工作中, 我们显示一个单一算法 -- -- 一个简单的扩展, 即镜像正向下, 我们称之为磁镜下行( MMDD) -- -- 可以在两种情况下产生强烈的结果。 相反, 我们从理论角度证明, 2p0 游戏中接近纳什和四等响应平衡( QRES) 的游戏理论算法是线性一致的, 这是第一个被证明是第一个通过自玩来应用的表式纳什均衡解算法。 我们从经验上显示, MMDMD( CFR) 在常规和广制游戏中, 以全面反馈( MMDR 3) 显示我们第一次在B RB 的深度的 R- RBR 标准的ML 和ML 的 RB 上, 可以显示一个标准的 RBL 。