We study decentralized learning in two-player zero-sum discounted Markov games where the goal is to design a policy optimization algorithm for either agent satisfying two properties. First, the player does not need to know the policy of the opponent to update its policy. Second, when both players adopt the algorithm, their joint policy converges to a Nash equilibrium of the game. To this end, we construct a meta algorithm, dubbed as $\texttt{Homotopy-PO}$, which provably finds a Nash equilibrium at a global linear rate. In particular, $\texttt{Homotopy-PO}$ interweaves two base algorithms $\texttt{Local-Fast}$ and $\texttt{Global-Slow}$ via homotopy continuation. $\texttt{Local-Fast}$ is an algorithm that enjoys local linear convergence while $\texttt{Global-Slow}$ is an algorithm that converges globally but at a slower sublinear rate. By switching between these two base algorithms, $\texttt{Global-Slow}$ essentially serves as a ``guide'' which identifies a benign neighborhood where $\texttt{Local-Fast}$ enjoys fast convergence. However, since the exact size of such a neighborhood is unknown, we apply a doubling trick to switch between these two base algorithms. The switching scheme is delicately designed so that the aggregated performance of the algorithm is driven by $\texttt{Local-Fast}$. Furthermore, we prove that $\texttt{Local-Fast}$ and $\texttt{Global-Slow}$ can both be instantiated by variants of optimistic gradient descent/ascent (OGDA) method, which is of independent interest.
翻译:我们用两个玩家零和折扣的Markov游戏来研究分散学习,目标是为满足两个属性的任一代理商设计一个政策优化算法。 首先, 玩家不需要知道对手更新其政策的政策。 其次, 当两个玩家都采用算法, 他们的联合政策会与游戏的纳什平衡相融合。 为此, 我们构建了一个元算法, 被称为$\ textt{ Homotopy- PO} $, 它可以在全球线性速度中找到纳什平衡 。 特别是, $\ tt{ Homotopy- PO} $ 的双基算法 。 美元- tweave 2 基算法 $ textt{ 本地- fast} $ 和 $ texttralt} 基算法, 我们的基底基底值和基底基底值的基底值是 美元, 美元基底值的基底值, 美元的基底值是O的基底值, 的基底值是O 的基底值, 的基底值是基底值的基底值, 的基底的基底值, 的基底值是基底值的基底值的基底值的基底值的基值的基值的基值的基值是, 。</s>