We present new independent learning dynamics provably converging to an efficient equilibrium (also known as optimal equilibrium) maximizing the social welfare in infinite-horizon discounted identical-interest Markov games (MG), beyond the recent concentration of progress on provable convergence to some (possibly inefficient) equilibrium. The dynamics are independent in the sense that agents take actions without considering the others' objectives in their decision-making process, and their decisions are consistent with their objectives based on behavioral learning models. Independent and simultaneous adaptation of agents in an MG poses the key challenges: i) possible convergence to an inefficient equilibrium and ii) possible non-stationarity of the environment from a single agent's viewpoint. We address the former by generalizing the log-linear learning dynamics to MG settings and address the latter through the play-in-rounds scheme presented. Particularly, in an MG, agents play (normal-form) stage games associated with the state visited based on their continuation payoff estimates. We let the agents play these stage games in rounds such that their continuation payoff estimates get updated only at the end of the round. This makes these stage games stationary within each round. Hence, the dynamics approximate the value iteration and the convergence to the social optimum of the underlying MG follows.
翻译:我们提出了新的独立学习动态,与高效平衡(又称最佳平衡)相融合,在无限和顺利折扣的相同利益Markov游戏(MG)中最大限度地实现社会福利,超越了最近进展集中于可实现的趋同与某些(可能效率低下的)平衡的集中,我们提出了新的独立学习动态,因为代理在其决策过程中不考虑他人的目标就采取行动,而他们的决定与基于行为学习模式的目标是一致的。在MG中独立和同步地调整代理商带来了关键挑战:i)从单一代理商的观点看,他们可能与效率低下的平衡趋同;ii)从单一代理商的观点看,环境可能不固定。我们通过将日线学习动态推广到MG的设置,并通过介绍的全场游戏处理后者。特别是在MG中,代理商根据其持续报酬估计,与所访问的国家相关的游戏(常规形式)阶段游戏。我们让代理商在这些阶段游戏中玩这些游戏,因此,其持续的报酬估计数只有在回合结束时才能更新。我们从单一代理商的观点来解决前者,我们通过将日线学习动力动力的动态的动态推广到每个回合内的最佳组合。