采用快速政策外快政策方法进行有内星正常化的竞争性运动会 (Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization)

This paper investigates the problem of computing the equilibrium of competitive games, which is often modeled as a constrained saddle-point optimization problem with probability simplex constraints. Despite recent efforts in understanding the last-iterate convergence of extragradient methods in the unconstrained setting, the theoretical underpinnings of these methods in the constrained settings, especially those using multiplicative updates, remain highly inadequate, even when the objective function is bilinear. Motivated by the algorithmic role of entropy regularization in single-agent reinforcement learning and game theory, we develop provably efficient extragradient methods to find the quantal response equilibrium (QRE) -- which are solutions to zero-sum two-player matrix games with entropy regularization -- at a linear rate. The proposed algorithms can be implemented in a decentralized manner, where each player executes symmetric and multiplicative updates iteratively using its own payoff without observing the opponent's actions directly. In addition, by controlling the knob of entropy regularization, the proposed algorithms can locate an approximate Nash equilibrium of the unregularized matrix game at a sublinear rate without assuming the Nash equilibrium to be unique. Our methods also lead to efficient policy extragradient algorithms for solving (entropy-regularized) zero-sum Markov games at similar rates. All of our convergence rates are nearly dimension-free, which are independent of the size of the state and action spaces up to logarithm factors, highlighting the positive role of entropy regularization for accelerating convergence.

翻译：本文调查了计算竞争性游戏平衡的问题。竞争性游戏通常被模拟成一个带有概率简单限制的有限马鞍优化问题。尽管最近努力理解了在不受限制的环境中,超升级方法在不限制的环境中的末端趋同,但这些方法的理论基础仍然非常不足, 特别是使用倍增更新的方法, 即使目标功能是双线的, 即使在目标函数是双线的。受单一试剂强化学习和游戏理论中的变相正规化算法作用的驱动, 我们开发了可被察觉到的高效超高级方法, 以找到四级反应平衡( QRE) -- -- 这是以直线速正规化的零和双级游戏基矩阵游戏的解决方案。提议的算法可以分散实施, 每个玩家在不直接观察对手动作的情况下, 反复使用对称和倍增倍的更新。此外, 拟议的算法可以将非正规化矩阵组合游戏的近端平流平衡率定位为次直线率, 并且我们所有的平级游戏的平流率也是唯一的。