We study the performance of the gradient play algorithm for stochastic games (SGs), where each agent tries to maximize its own total discounted reward by making decisions independently based on current state information which is shared between agents. Policies are directly parameterized by the probability of choosing a certain action at a given state. We show that Nash equilibria (NEs) and first-order stationary policies are equivalent in this setting, and give a local convergence rate around strict NEs. Further, for a subclass of SGs called Markov potential games (which includes the cooperative setting with identical rewards among agents as an important special case), we design a sample-based reinforcement learning algorithm and give a non-asymptotic global convergence rate analysis for both exact gradient play and our sample-based learning algorithm. Our result shows that the number of iterations to reach an $\epsilon$-NE scales linearly, instead of exponentially, with the number of agents. Local geometry and local stability are also considered, where we prove that strict NEs are local maxima of the total potential function and fully-mixed NEs are saddle points.
翻译:我们研究Stochistic游戏(SGs)的梯度游戏算法的性能,每个代理商都试图通过独立地根据代理商之间分享的当前状态信息作出决定,最大限度地提高自己的全部折扣奖励。政策直接以在特定状态选择某种行动的概率为参数。我们显示,Nash equilibria(NES)和一阶固定政策在这个环境里是相等的,并给出严格NES周围的本地趋同率。此外,对于被称为Markov 潜在游戏的亚类(包括合作设置,使代理商之间获得相同回报,成为一个重要的特殊案例),我们设计了一个基于样本的强化学习算法,对精确的梯度游戏和我们基于样本的学习算法都进行非零缓冲的全球趋同率分析。我们的结果显示,与代理商数量相比,将线性地达到美元-NEE的循环数量是线性,而不是指数式的。还考虑了当地几何测量和当地稳定性,我们证明严格的NE是全部潜在功能的本地峰值,而完全混合NE是马鞍点。