We propose a multi-agent reinforcement learning dynamics, and analyze its convergence in infinite-horizon discounted Markov potential games. We focus on the independent and decentralized setting, where players do not have knowledge of the game model and cannot coordinate. In each stage, players update their estimate of a perturbed Q-function that evaluates their total contingent payoff based on the realized one-stage reward in an asynchronous manner. Then, players independently update their policies by incorporating a smoothed optimal one-stage deviation strategy based on the estimated Q-function. A key feature of the learning dynamics is that the Q-function estimates are updated at a faster timescale than the policies. We prove that the policies induced by our learning dynamics converge to a stationary Nash equilibrium in Markov potential games with probability 1. Our results highlight the efficacy of simple learning dynamics in reaching a stationary Nash equilibrium even in environments with minimal information available.
翻译:我们提出了一种多智能体强化学习动态机制,并分析了它在无限时段折扣马尔可夫势博弈中的收敛性。我们关注独立且分散的研究设置,其中每个玩家都不知道游戏模型,也没有协调的能力。在每个阶段中,玩家以异步的方式更新其评估总体相关报酬的扰动Q函数的估计值。 然后,玩家根据估计的Q函数结合了一个平滑的最优一阶偏差策略独立地更新其策略。该学习动态机制的一个关键特点是Q函数估计值在策略之前更新的时间尺度更快。 我们证明了通过我们的学习动态所诱导的策略在无限时段折扣马尔可夫势博弈中以1的概率收敛于一个静态纳什均衡。我们的结果突显了在仅有极少信息可用的环境中通过简单的学习动态达到静态纳什均衡的效力。