We consider the problem of zero-shot coordination - constructing AI agents that can coordinate with novel partners they have not seen before (e.g. humans). Standard Multi-Agent Reinforcement Learning (MARL) methods typically focus on the self-play (SP) setting where agents construct strategies by playing the game with themselves repeatedly. Unfortunately, applying SP naively to the zero-shot coordination problem can produce agents that establish highly specialized conventions that do not carry over to novel partners they have not been trained with. We introduce a novel learning algorithm called other-play (OP), that enhances self-play by looking for more robust strategies, exploiting the presence of known symmetries in the underlying problem. We characterize OP theoretically as well as experimentally. We study the cooperative card game Hanabi and show that OP agents achieve higher scores when paired with independently trained agents. In preliminary results we also show that our OP agents obtains higher average scores when paired with human players, compared to state-of-the-art SP agents.
翻译:我们考虑了零点协调问题 — 构建可以与他们以前从未见过的新伙伴(例如人类)协调的AI代理商。 标准的多机构强化学习(MARL)方法通常侧重于自我游戏(SP)设置,代理商通过反复玩游戏来构建战略。 不幸的是,天真地将SP应用于零点协调问题可以产生建立高度专业化的公约的代理商,而这种公约并没有被培训到新伙伴手中。 我们引入了一种叫作其他游戏(OP)的新颖的学习算法,这种算法通过寻找更强有力的战略来增强自我游戏,利用已知的对称在根本问题中的存在。我们在理论上和实验性地描述OP的特征。我们研究了合作牌游戏Hanabi(Hanabi),并表明OP代理商在与独立培训的代理商配对时会获得更高的分数。 在初步结果中,我们还表明我们的OP代理商与人类玩家配对时,与最先进的SP代理商相比获得更高的平均分数。