We consider a subclass of $n$-player stochastic games, in which players have their own internal state/action spaces while they are coupled through their payoff functions. It is assumed that players' internal chains are driven by independent transition probabilities. Moreover, players can only receive realizations of their payoffs but not the actual functions, nor can they observe each others' states/actions. Under some assumptions on the structure of the payoff functions, we develop efficient learning algorithms based on Dual Averaging and Dual Mirror Descent, which provably converge almost surely or in expectation to the set of $\epsilon$-Nash equilibrium policies. In particular, we derive upper bounds on the number of iterates that scale polynomially in terms of the game parameters to achieve an $\epsilon$-Nash equilibrium policy. Besides Markov potential games and linear-quadratic stochastic games, this work provides another interesting subclass of $n$-player stochastic games that under some assumption provably admit polynomial-time learning algorithm for finding their $\epsilon$-Nash equilibrium policies.
翻译:我们考虑的是一小类的美元玩家随机游戏,其中玩家拥有自己的内部状态/行动空间,而他们通过报酬功能相互配合。我们假设玩家的内部链条是由独立的过渡概率驱动的。此外,玩家只能得到报酬的实现,而不是实际功能,他们也不能观察对方的状态/行动。根据对报酬功能结构的一些假设,我们开发了基于双轨和双镜源的高效学习算法,几乎可以肯定地或预期到美元-纳什平衡政策的组合。特别是,我们从游戏参数的大小上,我们得出了多元规模的游戏数量,以达到美元-纳什平衡政策。除了马尔科夫潜在的游戏和线性夸度的游戏外,这项工作提供了另一种有趣的小类,即以美元游戏和双镜源为主的游戏,在某种假设中可以肯定地接受的多时学习算法,以寻找美元-纳什平衡政策。