异质智能体强化学习 (Heterogeneous-Agent Reinforcement Learning)

The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in AI research. However, many research endeavours heavily rely on parameter sharing among agents, which confines them to only homogeneous-agent setting and leads to training instability and lack of convergence guarantees. To achieve effective cooperation in the general heterogeneous-agent setting, we propose Heterogeneous-Agent Reinforcement Learning (HARL) algorithms that resolve the aforementioned issues. Central to our findings are the multi-agent advantage decomposition lemma and the sequential update scheme. Based on these, we develop the provably correct Heterogeneous-Agent Trust Region Learning (HATRL) that is free of parameter-sharing constraint, and derive HATRPO and HAPPO by tractable approximations. Furthermore, we discover a novel framework named Heterogeneous-Agent Mirror Learning (HAML), which strengthens theoretical guarantees for HATRPO and HAPPO and provides a general template for cooperative MARL algorithmic designs. We prove that all algorithms derived from HAML inherently enjoy monotonic improvement of joint reward and convergence to Nash Equilibrium. As its natural outcome, HAML validates more novel algorithms in addition to HATRPO and HAPPO, including HAA2C, HADDPG, and HATD3, which consistently outperform their existing MA-counterparts. We comprehensively test HARL algorithms on six challenging benchmarks and demonstrate their superior effectiveness and stability for coordinating heterogeneous agents compared to strong baselines such as MAPPO and QMIX.

翻译：必须要求智能机器之间合作的需求，使得合作式多智能体强化学习(MARL)在人工智能研究中受到广泛关注。然而，许多研究工作在智能体之间重度依赖参数共享，这限制了它们仅适用于同质智能体设置，导致训练不稳定和收敛保证不足。为了在一般异质智能体设置中实现有效的合作，我们提出了异质智能体强化学习(HARL)算法，以解决上述问题。我们的结果的核心在于多智能体优势分解引理和顺序更新方案。基于这些结果，我们开发了经过证明正确的异质智能体信任区域学习(HATRL)，该算法不受参数共享约束，并通过可行的逼近导出了HATRPO和HAPPO算法。此外，我们发现一个称为异质智能体镜面学习(HAML)的新框架，它加强了HATRPO和HAPPO的理论保证，并提供了合作型MARL算法设计的通用模板。我们证明所有派生自HAML的算法本质上享有联合奖励的单调改进和收敛到纳什均衡的特性。作为其自然结果，HAML验证了更多的新算法，除了HATRPO和HAPPO之外，还包括HAA2C、HADDPG和HATD3，这些算法始终优于其现有的MA对应物。我们在六个具有挑战性的基准测试上全面测试了HARL算法，并展示了它们相对于强大的基线，如MAPPO和QMIX，在协调异质智能体方面的卓越性和稳定性。