In multi-agent reinforcement learning (MARL), independent learning (IL) often shows remarkable performance and easily scales with the number of agents. Yet, using IL can be inefficient and runs the risk of failing to successfully train, particularly in scenarios that require agents to coordinate their actions. Using centralised learning (CL) enables MARL agents to quickly learn how to coordinate their behaviour but employing CL everywhere is often prohibitively expensive in real-world applications. Besides, using CL in value-based methods often needs strong representational constraints (e.g. individual-global-max condition) that can lead to poor performance if violated. In this paper, we introduce a novel plug & play IL framework named Multi-Agent Network Selection Algorithm (MANSA) which selectively employs CL only at states that require coordination. At its core, MANSA has an additional agent that uses switching controls to quickly learn the best states to activate CL during training, using CL only where necessary and vastly reducing the computational burden of CL. Our theory proves MANSA preserves cooperative MARL convergence properties, boosts IL performance and can optimally make use of a fixed budget on the number CL calls. We show empirically in Level-based Foraging (LBF) and StarCraft Multi-agent Challenge (SMAC) that MANSA achieves fast, superior and more reliable performance while making 40% fewer CL calls in SMAC and using CL at only 1% CL calls in LBF.
翻译:在多试剂强化学习(MARL)中,独立学习(IL)往往表现出显著的绩效,而且与代理人数量相比,其规模也很容易。然而,使用IL可能效率低下,而且有可能无法成功培训,特别是在需要代理人协调行动的情况下。使用集中学习(CL)使MARL代理能够迅速学会如何协调他们的行为,但在现实世界应用中,使用CL往往成本极高。此外,在基于价值的方法中使用CL往往需要强大的代表性限制(例如个人-全球最大成份状况),如果违反的话,可能导致业绩不佳。在本文中,我们引入了一个名为多点网络选择 Algorithm(MANSA)的新型插座和游戏IL框架,该框架只选择在需要协调的州使用CL。在其核心中,使用另一个额外的代理工具,即使用转换控制快速学习最佳状态在培训期间启动CLL,仅在必要时使用CL,并大大减轻CL的计算负担。我们的理论证明,MASA保存了MAL合作的趋同性趋同属性,提高IL的性表现,并且可以最佳地在SAL上使用C-CARCL的成绩,同时在SMAL上显示SAL的成绩要求。</s>