We prove that optimistic-follow-the-regularized-leader (OFTRL), together with smooth value updates, finds an $O(T^{-1})$-approximate Nash equilibrium in $T$ iterations for two-player zero-sum Markov games with full information. This improves the $\tilde{O}(T^{-5/6})$ convergence rate recently shown in the paper Zhang et al (2022). The refined analysis hinges on two essential ingredients. First, the sum of the regrets of the two players, though not necessarily non-negative as in normal-form games, is approximately non-negative in Markov games. This property allows us to bound the second-order path lengths of the learning dynamics. Second, we prove a tighter algebraic inequality regarding the weights deployed by OFTRL that shaves an extra $\log T$ factor. This crucial improvement enables the inductive analysis that leads to the final $O(T^{-1})$ rate.
翻译:我们证明,乐观的追随者(OFTRL),加上平滑的值更新,在有完整信息的双玩者马可夫游戏中,发现美元(T ⁇ -1})近似纳什平衡值为美元(T$+0和马可夫游戏)。这改进了张等人(2022)论文中最近显示的美元(T ⁇ -5/6})趋同率。经过改进的分析取决于两个基本因素。首先,两个玩家的遗憾之和,虽然在普通游戏中不一定是负的,但在马尔科夫游戏中却大致是非负的。这一属性使我们得以将学习动态的二阶路径长度捆绑起来。第二,我们证明,在冲刷额外美元(T ⁇ -1}的OFTRL所部署的重量方面,我们更严重的变位不平等性。这一关键的改进使感应分析能够导致最终的美元(T ⁇ -1}费率。