We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play. Our algorithm is based on running an Optimistic Gradient Descent Ascent algorithm on each state to learn the policies, with a critic that slowly learns the value of each state. To the best of our knowledge, this is the first algorithm in this setting that is simultaneously rational (converging to the opponent's best response when it uses a stationary policy), convergent (converging to the set of Nash equilibria under self-play), agnostic (no need to know the actions played by the opponent), symmetric (players taking symmetric roles in the algorithm), and enjoying a finite-time last-iterate convergence guarantee, all of which are desirable properties of decentralized algorithms.
翻译:我们研究的是无穷的分级算法,并开发了一种分散的算法,这种算法可以与自玩的纳什平衡相融合。 我们的算法基于在每个州运行一个优化的梯度梯子梯度算法以学习政策,而批评者则慢慢地学习了每个州的价值。 据我们所知,这是这个环境中第一个同时理性的算法(在使用固定政策时与对手的最佳反应相融合 ), 集中(在自玩时与纳什平衡法组合相融合 ), 随机(不需要知道对手的行为 ), 对称(玩家在算法中扮演对称角色 ), 享受有限时间的上世纪趋同保证, 所有这些都是分散算法的可取属性 。