We study Nash equilibria learning of a general-sum stochastic game with an unknown transition probability density function. Agents take actions at the current environment state and their joint action influences the transition of the environment state and their immediate rewards. Each agent only observes the environment state and its own immediate reward and is unknown about the actions or immediate rewards of others. We introduce the concepts of weighted asymptotic Nash equilibrium with probability 1 and in probability. For the case with exact pseudo gradients, we design a two-loop algorithm by the equivalence of Nash equilibrium and variational inequality problems. In the outer loop, we sequentially update a constructed strongly monotone variational inequality by updating a proximal parameter while employing a single-call extra-gradient algorithm in the inner loop for solving the constructed variational inequality. We show that if the associated Minty variational inequality has a solution, then the designed algorithm converges to the k^{1/2}-weighted asymptotic Nash equilibrium. Further, for the case with unknown pseudo gradients, we propose a decentralized algorithm, where the G(PO)MDP gradient estimator of the pseudo gradient is provided by Monte-Carlo simulations. The convergence to the k^{1/4} -weighted asymptotic Nash equilibrium in probability is achieved.
翻译:我们研究Nash equilibrial 学习一个普通和随机游戏, 其过渡概率密度功能未知。 代理在当前的环境状态下采取行动, 他们的联合行动影响环境状态的转型及其直接回报。 每个代理只观察环境状态及其直接的奖赏, 并且对其它的动作或直接奖赏并不知情。 我们引入了加权无损纳什均衡的概念, 概率1 和概率。 对于精确的伪梯度, 我们设计了一种双圈算法, 其相当于纳什均衡和变异性不平等问题。 在外环中, 我们依次更新一个构建的强单体型变异性变异性变异性参数, 其方法是在内部环绕中使用一个单调超梯度的算法, 以解决构建的变异性不平等。 我们显示, 如果相关的Minty变异性不平等有一个解决方案, 那么设计的算法会与 k ⁇ 1/2} 重量的亚湿度的湿度纳什平衡相融合。 此外, 对于未知的假梯度和变异性平衡, 我们提议一种分散的算法, 在其中, G(PO) MDP ASimatelationalimatelatelationlationlationlationlationlationlationlationlation1 k- slationallationallationlationallatelatelatelatelatelations。