A natural solution concept for many multiagent settings is the Stackelberg equilibrium, under which a ``leader'' agent selects a strategy that maximizes its own payoff assuming the ``follower'' chooses their best response to this strategy. Recent work has presented asymmetric learning updates that can be shown to converge to the \textit{differential} Stackelberg equilibria of two-player differentiable games. These updates are ``coupled'' in the sense that the leader requires some information about the follower's payoff function. Such coupled learning rules cannot be applied to \textit{ad hoc} interactive learning settings, and can be computationally impractical even in centralized training settings where the follower's payoffs are known. In this work, we present an ``uncoupled'' learning process under which each player's learning update only depends on their observations of the other's behavior. We prove that this process converges to a local Stackelberg equilibrium under similar conditions as previous coupled methods. We conclude with a discussion of the potential applications of our approach to human--AI cooperation and multi-agent reinforcement learning.
翻译:许多多试剂设置的自然解决方案概念是Stackelberg平衡,根据这个平衡,“领导者”的代理商选择了一种战略,在“追随者”选择了最佳回应策略的情况下,使自己的回报最大化。最近的工作提供了不对称的学习更新,可以显示它与两个玩家不同游戏的“textit{production} Stakkelberg equilibria”趋同。这些更新是“串通”的,因为领导者需要关于后续者报酬功能的一些信息。这种结合学习规则不能适用于\ textit{adadsuid}交互式学习设置,而且即使在了解追随者报酬的集中培训环境中,也可以计算不切实际。在这项工作中,我们提出了一个“不相交错的”学习过程,每个玩家学习更新仅取决于对对方行为的观察。我们证明,这一过程在与以往同时使用的方法相似的条件下,与本地的“斯塔克伦贝格”平衡。我们最后讨论了我们对人类-AI合作和多代理人学习加强合作的方法的潜在应用。