Algorithms designed for single-agent reinforcement learning (RL) generally fail to converge to equilibria in two-player zero-sum (2p0s) games. On the other hand, game-theoretic algorithms for approximating Nash and regularized equilibria in 2p0s games are not typically competitive for RL and can be difficult to scale. As a result, algorithms for these two cases are generally developed and evaluated separately. In this work, we show that a single algorithm can produce strong results in both settings, despite their fundamental differences. This algorithm, which we call magnet mirror descent (MMD), is a simple extension to mirror descent and a special case of a non-Euclidean proximal gradient algorithm. From a theoretical standpoint, we prove a novel linear convergence for this non-Euclidean proximal gradient algorithm for a class of variational inequality problems. It follows from this result that MMD converges linearly to quantal response equilibria (i.e., entropy regularized Nash equilibria) in extensive-form games; this is the first time linear convergence has been proven for a first order solver. Moreover, applied as a tabular Nash equilibrium solver via self-play, we show empirically that MMD produces results competitive with CFR; this is the first time that a standard RL algorithm has done so. Furthermore, for single-agent deep RL, on a small collection of Atari and Mujoco tasks, we show that MMD can produce results competitive with those of PPO. Lastly, for multi-agent deep RL, we show MMD can outperform NFSP in 3x3 Abrupt Dark Hex.
翻译:用于单试剂强化学习( RL) 的算法通常无法在两个玩家零和( 20P0s) 游戏中达到平衡。 另一方面, 2p0 游戏中接近 Nash 和正规化的 equilibria 的游戏理论算法对于 RL 来说通常不具有竞争力, 并且可能难以缩放。 因此, 这两个案子的算法一般是单独制定和评估的。 在这项工作中, 我们显示一个单一的算法可以在两个设置中产生强烈的结果。 这个我们称之为磁镜下流( MMD) 的计算法( MMD) 是一个简单的反射镜下流( MMD) 的扩展, 以及一个非Euclidean Procialal 的特例。 从理论上看, 我们证明这个非欧元的精度梯值梯值的算法是一个新的线性。 从这个结果来看, MLML 可以将ML 直线性对等的响应( ) 。