Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply; this is because agents, even in cooperative games, could have conflicting directions of policy updates. As a result, achieving a guaranteed improvement on the joint policy where each agent acts individually remains an open challenge. In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike many existing MARL algorithms, HATRPO/HAPPO do not need agents to share parameters, nor do they need any restrictive assumptions on decomposibility of the joint value function. Most importantly, we justify in theory the monotonic improvement property of HATRPO/HAPPO. We evaluate the proposed methods on a series of Multi-Agent MuJoCo and StarCraftII tasks. Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, therefore establishing a new state of the art.
翻译:信任区域方法严格地使强化学习(RL)代理商能够学习单项改进政策,从而导致在各种任务上取得优异业绩。 不幸的是,在多剂强化学习(MARL)方面,单质改进的特性可能无法简单地适用;这是因为,即使在合作游戏中,代理商也可能在政策更新方面出现相互矛盾的方向。因此,在联合政策方面,每个代理商单独行动仍是一个公开的挑战,在联合政策方面取得有保障的改进。在本文件中,我们将信任区域学习理论推广到MARL。我们发现的核心是多剂优势分解利玛和顺序政策更新计划。基于这些,我们开发了异质性典型信任区域政策优化(HATPRO)和异质性典型政策优化(HAPPO)算法。与许多现有的联合政策算法不同,HARTRPO/HAPPO并不需要代理商共享参数,因此他们不需要任何关于联合价值功能不相容的限制性假设。最重要的是,我们从理论上证明HTRPO/MAPO的大幅改进了HA-PO的软性任务。