以近似政策迭代解决共同交易游戏 (Solving Common-Payoff Games with Approximate Policy Iteration)

For artificially intelligent learning systems to have widespread applicability in real-world settings, it is important that they be able to operate decentrally. Unfortunately, decentralized control is difficult -- computing even an epsilon-optimal joint policy is a NEXP complete problem. Nevertheless, a recently rediscovered insight -- that a team of agents can coordinate via common knowledge -- has given rise to algorithms capable of finding optimal joint policies in small common-payoff games. The Bayesian action decoder (BAD) leverages this insight and deep reinforcement learning to scale to games as large as two-player Hanabi. However, the approximations it uses to do so prevent it from discovering optimal joint policies even in games small enough to brute force optimal solutions. This work proposes CAPI, a novel algorithm which, like BAD, combines common knowledge with deep reinforcement learning. However, unlike BAD, CAPI prioritizes the propensity to discover optimal joint policies over scalability. While this choice precludes CAPI from scaling to games as large as Hanabi, empirical results demonstrate that, on the games to which CAPI does scale, it is capable of discovering optimal joint policies even when other modern multi-agent reinforcement learning algorithms are unable to do so. Code is available at https://github.com/ssokota/capi .

翻译：要使人工智能学习系统在现实世界环境中具有广泛应用性,重要的是它们能够分散操作。不幸的是,分散控制是困难的 -- -- 即使是计算极松-最优联合政策也是一个完整的 NEXP 问题。然而,最近重新发现的洞察力 -- -- 一个代理团队可以通过共同的知识进行协调 -- -- 产生了能够找到在小型共同付费游戏中最佳联合政策的算法。Bayesian Action decoder(BAD)利用这种洞察力和深度强化学习到像 Hanabi 那样大的游戏。然而,它所使用的近似方法阻止它发现最佳联合政策,即使是在小到足以强制推行最佳解决方案的游戏中。这项工作提出了CAPI,像BAD一样,将共同知识与深层强化学习结合起来的新型算法。然而,CAPI优先考虑在发现最佳联合政策超越可缩缩放性时,其倾向使CAPI无法像 Hanabi那样大地推广到游戏,实证结果表明,即使CAPI/COFAL在游戏上也无法找到最佳的联合政策。