We consider a subclass of $n$-player stochastic games, in which players have their own internal state/action spaces while they are coupled through their payoff functions. It is assumed that players' internal chains are driven by independent transition probabilities. Moreover, players can receive only realizations of their payoffs, not the actual functions, and cannot observe each other's states/actions. For this class of games, we first show that finding a stationary Nash equilibrium (NE) policy without any assumption on the reward functions is interactable. However, for general reward functions, we develop polynomial-time learning algorithms based on dual averaging and dual mirror descent, which converge in terms of the averaged Nikaido-Isoda distance to the set of $\epsilon$-NE policies almost surely or in expectation. In particular, under extra assumptions on the reward functions such as social concavity, we derive polynomial upper bounds on the number of iterates to achieve an $\epsilon$-NE policy with high probability. Finally, we evaluate the effectiveness of the proposed algorithms in learning $\epsilon$-NE policies using numerical experiments for energy management in smart grids.
翻译:我们考虑$n$人随机博弈的一个子类,其中玩家拥有自己的内部状态/动作空间,同时通过其支付函数相互耦合。假设玩家的内部链由独立的转移概率驱动。此外,玩家只能接收到他们的支付的实现,而不能观察彼此的状态/动作。对于这类游戏,我们首先证明在没有任何关于奖励函数的假设的情况下,找到平稳Nash均衡(NE)策略是交互作用的。然而,针对一般的奖励函数,我们基于双平均值和双镜像下降的多项式时间学习算法,它们在平均的Nikaido-Isoda距离几乎总是期望收敛到ε-NE策略集。特别的,在对于奖励函数的额外假设,例如社交凹性的情况下,我们导出了多项式的迭代次数的上界,以达到具有极高概率的ε-NE策略。最后,我们使用在智能电网能源管理中学习ε-NE策略的数值实验来评估所提出算法的有效性。