The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in the artificial intelligence (AI) research community. However, many research endeavors have been focused on developing practical MARL algorithms whose effectiveness has been studied only empirically, thereby lacking theoretical guarantees. As recent studies have revealed, MARL methods often achieve performance that is unstable in terms of reward monotonicity or suboptimal at convergence. To resolve these issues, in this paper, we introduce a novel framework named Heterogeneous-Agent Mirror Learning (HAML) that provides a general template for MARL algorithmic designs. We prove that algorithms derived from the HAML template satisfy the desired properties of the monotonic improvement of the joint reward and the convergence to Nash equilibrium. We verify the practicality of HAML by proving that the current state-of-the-art cooperative MARL algorithms, HATRPO and HAPPO, are in fact HAML instances. Next, as a natural outcome of our theory, we propose HAML extensions of two well-known RL algorithms, HAA2C (for A2C) and HADDPG (for DDPG), and demonstrate their effectiveness against strong baselines on StarCraftII and Multi-Agent MuJoCo tasks.
翻译:智能机器之间合作的必要性在人工智能研究界推广了合作性多试剂强化学习(MARL)的必要性,然而,许多研究努力的重点是开发实用的MAL算法,这种算法的效力仅经过经验研究,因而缺乏理论保障。正如最近的研究表明,MARL方法往往在奖励单调性或不最优化的趋同性方面达到不稳定的性能。为了解决这些问题,我们在本文件中引入了一个名为异质图像学习(HAML)的新框架,为MARL算法设计提供一个通用模板。我们证明,HAML模板产生的算法符合联合奖励和与纳什平衡趋同的单调改进的预期特性。我们通过证明目前最先进的MARL计算法、HATRPO和HAPPO(HAPPO)是“HAML”实例,作为我们理论的自然结果,我们建议HAML扩展两个著名的RL算法,HAA2C(用于A2C)和HADOC-MOLG(用于DDPG的强大基准和MOLG),并展示其强大的MAL-CO-CO-G(用于DOL-G)和MAL-DOLG(用于DOL-DOL-G)基准,以显示其强大的基准和MOL-CO-G),以证明其有效性,以证明其效力,以证明其实际和MOL-G-CO-CO-CO-CO-CO-CO-CO-CO-CO-G),以证明其有效性,以证明其实际性能。