多边机构受约束政策优化 (Multi-Agent Constrained Policy Optimisation)

Developing reinforcement learning algorithms that satisfy safety constraints is becoming increasingly important in real-world applications. In multi-agent reinforcement learning (MARL) settings, policy optimisation with safety awareness is particularly challenging because each individual agent has to not only meet its own safety constraints, but also consider those of others so that their joint behaviour can be guaranteed safe. Despite its importance, the problem of safe multi-agent learning has not been rigorously studied; very few solutions have been proposed, nor a sharable testing environment or benchmarks. To fill these gaps, in this work, we formulate the safe MARL problem as a constrained Markov game and solve it with policy optimisation methods. Our solutions -- Multi-Agent Constrained Policy Optimisation (MACPO) and MAPPO-Lagrangian -- leverage the theories from both constrained policy optimisation and multi-agent trust region learning. Crucially, our methods enjoy theoretical guarantees of both monotonic improvement in reward and satisfaction of safety constraints at every iteration. To examine the effectiveness of our methods, we develop the benchmark suite of Safe Multi-Agent MuJoCo that involves a variety of MARL baselines. Experimental results justify that MACPO/MAPPO-Lagrangian can consistently satisfy safety constraints, meanwhile achieving comparable performance to strong baselines.

翻译：在现实应用中,满足安全限制的强化学习算法正在变得日益重要。在多试剂强化学习(MARL)环境中,政策优化与安全意识特别具有挑战性,因为每个机构不仅要满足自己的安全限制,而且要考虑其他人的共同行为才能保证安全。尽管这一点很重要,安全多试剂学习问题尚未得到严格研究;很少提出解决办法,也没有建立可辨别的测试环境或基准。为了填补这些空白,我们在这项工作中将安全MARL问题发展成一个受限的Markov游戏,并用政策优化方法加以解决。我们的解决办法 -- -- 多试金政策优化(MACPO)和MAPO-Lagrangian -- -- 利用了受限政策优化和多试剂信任区域学习的理论。值得注意的是,我们的方法在理论上得到保证,在每次测试时,在奖励和满足安全限制方面都有单一的改进和满意度。为了检查我们的方法的有效性,我们开发了安全多试科(MAR-MOL)基准套基准,这需要各种可比较的MAL基准。