Repeated games consider a situation where multiple agents are motivated by their independent rewards throughout learning. In general, the dynamics of their learning become complex. Especially when their rewards compete with each other like zero-sum games, the dynamics often do not converge to their optimum, i.e., Nash equilibrium. To tackle such complexity, many studies have understood various learning algorithms as dynamical systems and discovered qualitative insights among the algorithms. However, such studies have yet to handle multi-memory games (where agents can memorize actions they played in the past and choose their actions based on their memories), even though memorization plays a pivotal role in artificial intelligence and interpersonal relationship. This study extends two major learning algorithms in games, i.e., replicator dynamics and gradient ascent, into multi-memory games. Then, we prove their dynamics are identical. Furthermore, theoretically and experimentally, we clarify that the learning dynamics diverge from the Nash equilibrium in multi-memory zero-sum games and reach heteroclinic cycles (sojourn longer around the boundary of the strategy space), providing a fundamental advance in learning in games.
翻译:重复游戏会考虑多种代理商在整个学习过程中受到独立奖赏的激励。 一般来说, 他们的学习动态变得复杂。 特别是当他们的奖赏像零和游戏那样相互竞争时, 动态往往不会与它们的最佳组合, 即纳什均衡。 要解决这种复杂问题, 许多研究将不同的学习算法理解为动态系统, 并在算法中发现了质的洞察力。 然而, 这些研究还没有处理多种模拟游戏( 代理商可以记住他们过去玩过的动作, 并根据记忆选择他们的行动 ), 即使记忆化在人造智能和人际关系中起着关键作用。 这个研究将游戏中的两种主要学习算法, 即再生动力和梯度, 延伸到多模拟游戏中。 然后, 我们证明了它们的动态是相同的。 此外, 从理论上和实验上看, 我们澄清, 学习动态不同于多模拟零和接触周期的纳什平衡( 距离战略空间的边界更远), 提供了游戏学习的基本进步。