Modern multi-agent reinforcement learning frameworks rely on centralized training and reward shaping to perform well. However, centralized training and dense rewards are not readily available in the real world. Current multi-agent algorithms struggle to learn in the alternative setup of decentralized training or sparse rewards. To address these issues, we propose a self-supervised intrinsic reward ELIGN - expectation alignment - inspired by the self-organization principle in Zoology. Similar to how animals collaborate in a decentralized manner with those in their vicinity, agents trained with expectation alignment learn behaviors that match their neighbors' expectations. This allows the agents to learn collaborative behaviors without any external reward or centralized training. We demonstrate the efficacy of our approach across 6 tasks in the multi-agent particle and the complex Google Research football environments, comparing ELIGN to sparse and curiosity-based intrinsic rewards. When the number of agents increases, ELIGN scales well in all multi-agent tasks except for one where agents have different capabilities. We show that agent coordination improves through expectation alignment because agents learn to divide tasks amongst themselves, break coordination symmetries, and confuse adversaries. These results identify tasks where expectation alignment is a more useful strategy than curiosity-driven exploration for multi-agent coordination, enabling agents to do zero-shot coordination.
翻译:现代多试剂强化学习框架取决于集中培训和奖励,以取得良好效果。然而,集中培训和密集奖励并非现实世界中容易获得的。目前的多试剂算法在分散化培训或零分化奖励的替代设置中挣扎着学习。为了解决这些问题,我们提议由Zolog 的自我监督的内在奖励 ELIGIN - 期望调整 - 受Zolog 自我组织原则的启发。类似动物如何以分散方式与周围的动物合作,受过期望调整培训的代理人学习与其邻居期望相符的行为。这样,代理人就可以在没有任何外部奖励或集中训练的情况下学习合作行为。我们展示了我们在多试剂粒子和复杂的谷歌研究足球环境中的6项任务中的做法的功效,将ELIGINT与分散和好奇心基的内在奖励进行比较。当代理公司数量增加时,除了代理人能力不同的任务外,ELIGNS 尺度表 。我们表明,通过期望的配合,代理人通过期望调整,可以改进工作,因为代理人学会了彼此之间的分工,打破协调的对等,并混淆了对手。这些结果确定了在多试探管上更有用的协调战略。