We propose the first model-free algorithm that achieves low regret performance for decentralized learning in two-player zero-sum tabular stochastic games with infinite-horizon average-reward objective. In decentralized learning, the learning agent controls only one player and tries to achieve low regret performances against an arbitrary opponent. This contrasts with centralized learning where the agent tries to approximate the Nash equilibrium by controlling both players. In our infinite-horizon undiscounted setting, additional structure assumptions is needed to provide good behaviors of learning processes : here we assume for every strategy of the opponent, the agent has a way to go from any state to any other. This assumption is the analogous to the "communicating" assumption in the MDP setting. We show that our Decentralized Optimistic Nash Q-Learning (DONQ-learning) algorithm achieves both sublinear high probability regret of order $T^{3/4}$ and sublinear expected regret of order $T^{2/3}$. Moreover, our algorithm enjoys a low computational complexity and low memory space requirement compared to the previous works of (Wei et al. 2017) and (Jafarnia-Jahromi et al. 2021) in the same setting.
翻译:我们提出第一个无模式的算法,在两个玩家零和表表式随机游戏中实现分散化学习的低遗憾表现,其平均回报目标为无限正正正正。在分散化学习中,学习代理只控制一个玩家,并试图对任意的对手取得低遗憾表现。这与集中化学习形成对照,在集中学习中,该代理试图通过控制两个玩家来接近纳什平衡。在我们无限偏差的环境下,需要额外的结构假设来提供学习过程的良好行为:我们在这里假设对手的每一个策略,该代理都可以从任何州到任何其他州。这一假设类似于MDP环境中的“共鸣”假设。我们表明,我们分散式的乐观纳什学习(DONQ学习)算法既取得了对$T+3/4 $和亚马力(Wei&al21 和alimia) 和 Al-alima. (201717) 和 Al-alimia 和 Al-al-almai) 的先前作品的亚性高概率,也实现了亚性很高的低度遗憾。