The standard problem setting in cooperative multi-agent settings is self-play (SP), where the goal is to train a team of agents that works well together. However, optimal SP policies commonly contain arbitrary conventions ("handshakes") and are not compatible with other, independently trained agents or humans. This latter desiderata was recently formalized by Hu et al. 2020 as the zero-shot coordination (ZSC) setting and partially addressed with their Other-Play (OP) algorithm, which showed improved ZSC and human-AI performance in the card game Hanabi. OP assumes access to the symmetries of the environment and prevents agents from breaking these in a mutually incompatible way during training. However, as the authors point out, discovering symmetries for a given environment is a computationally hard problem. Instead, we show that through a simple adaption of k-level reasoning (KLR) Costa Gomes et al. 2006, synchronously training all levels, we can obtain competitive ZSC and ad-hoc teamplay performance in Hanabi, including when paired with a human-like proxy bot. We also introduce a new method, synchronous-k-level reasoning with a best response (SyKLRBR), which further improves performance on our synchronous KLR by co-training a best response.
翻译:合作性多试剂环境中的标准问题设置是自玩(SP),目标是训练一组合作良好的代理人,但是,最佳的SP政策通常含有任意的公约(“握手”),与其他独立培训的代理人或人不兼容。后一种脱衣舞最近由Hu等人(Hu等人)正式确定为零点协调(ZSC)设置,并与其其他Play(OP)算法部分地解决,该算法显示,在Hanabi牌游戏中ZSC和人类-AI的表现有所改善。OP假定可以接触环境的对称,防止代理人在培训期间以互不兼容的方式打破这些。然而,正如作者指出的,发现特定环境的对称是一个计算困难的问题。相反,我们通过简单的K级推理(KLR)Cost Gomes et al. 2006,同步培训所有级别,我们可以在Hanabi获得有竞争力的ZSC和人称代理机器人的团队表演表现,包括当与人称代理机器人对齐时,我们还引入了一种与最佳水平的同步的同步演算法,我们还引入了一种新型的K-RRRS同步的新的方法。