This paper is concerned with two-player zero-sum Markov games -- arguably the most basic setting in multi-agent reinforcement learning -- with the goal of learning a Nash equilibrium (NE) sample-optimally. All prior results suffer from at least one of the two obstacles: the curse of multiple agents and the barrier of long horizon, regardless of the sampling protocol in use. We take a step towards settling this problem, assuming access to a flexible sampling mechanism: the generative model. Focusing on non-stationary finite-horizon Markov games, we develop a learning algorithm $\mathsf{Nash}\text{-}\mathsf{Q}\text{-}\mathsf{FTRL}$ and an adaptive sampling scheme that leverage the optimism principle in adversarial learning (particularly the Follow-the-Regularized-Leader (FTRL) method), with a delicate design of bonus terms that ensure certain decomposability under the FTRL dynamics. Our algorithm learns an $\varepsilon$-approximate Markov NE policy using $$ \widetilde{O}\bigg( \frac{H^4 S(A+B)}{\varepsilon^2} \bigg) $$ samples, where $S$ is the number of states, $H$ is the horizon, and $A$ (resp.~$B$) denotes the number of actions for the max-player (resp.~min-player). This is nearly un-improvable in a minimax sense. Along the way, we derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities, which might be of independent interest.
翻译:本文关注两个玩家的马可夫游戏 -- -- 可以说是多试剂强化学习的最基本设置 -- -- 目标是学习纳什均衡(NE)样本的抽样最理想性。所有先前的结果至少都存在两个障碍中的一个:多剂的诅咒和长距屏障,不管使用的抽样协议如何。我们采取了一步来解决这个问题,假设可以使用灵活的取样机制:基因化模型。聚焦于非固定的有限量马可夫游戏,我们开发了一个学习算法 $\ mathsf{纳什特{-mathsf{-mathsf* text{-mathsf{FTRL},以及一个适应性取样机制,在对抗性学习中利用乐观原则(特别是跟踪-Regulized-Leader (FTRL) 方法), 在FTRL 动态下可以确保某些不兼容性。我们的算法算法可以独立地用 $lace- markov 政策, $\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\