We consider a subclass of $n$-player stochastic games, in which players have their own internal state/action spaces while they are coupled through their payoff functions. It is assumed that players' internal chains are driven by independent transition probabilities. Moreover, players can receive only realizations of their payoffs, not the actual functions, and cannot observe each other's states/actions. Under some assumptions on the structure of the payoff functions, we develop efficient learning algorithms based on dual averaging and dual mirror descent, which provably converge almost surely or in expectation to the set of $\epsilon$-Nash equilibrium policies. In particular, we derive upper bounds on the number of iterates that scale polynomially in terms of the game parameters to achieve an $\epsilon$-Nash equilibrium policy. In addition to Markov potential games and linear-quadratic stochastic games, this work provides another subclass of $n$-player stochastic games that provably admit polynomial-time learning algorithms for finding their $\epsilon$-Nash equilibrium policies.
翻译:我们考虑的是一小类的美元玩家随机游戏,在这种游戏中,玩家拥有自己的内部状态/行动空间,而他们却通过报酬功能相互配合。我们假定玩家的内部链条是由独立的过渡概率驱动的。此外,玩家只能得到报酬的实现,而不是实际功能,不能观察对方的状态/行动。根据对报酬功能结构的一些假设,我们开发了基于双均和双镜下降的高效学习算法,这种算法几乎可以肯定地或预期地会与一套$/epsilon$-Nash平衡政策相融合。特别是,我们从游戏参数的大小上看,在达到美元-纳什平衡政策的游戏参数方面,我们从中得出了比例化的游戏数量。除了马尔科夫潜在游戏和线性夸式随机游戏之外,这项工作提供了另一个小类的美元玩家随机游戏,这些小类的游戏几乎可以肯定地结合或预期到$\epsilon-nash平衡政策。