While multi-agent trust region algorithms have achieved great success empirically in solving coordination tasks, most of them, however, suffer from a non-stationarity problem since agents update their policies simultaneously. In contrast, a sequential scheme that updates policies agent-by-agent provides another perspective and shows strong performance. However, sample inefficiency and lack of monotonic improvement guarantees for each agent are still the two significant challenges for the sequential scheme. In this paper, we propose the \textbf{A}gent-by-\textbf{a}gent \textbf{P}olicy \textbf{O}ptimization (A2PO) algorithm to improve the sample efficiency and retain the guarantees of monotonic improvement for each agent during training. We justify the tightness of the monotonic improvement bound compared with other trust region algorithms. From the perspective of sequentially updating agents, we further consider the effect of agent updating order and extend the theory of non-stationarity into the sequential update scheme. To evaluate A2PO, we conduct a comprehensive empirical study on four benchmarks: StarCraftII, Multi-agent MuJoCo, Multi-agent Particle Environment, and Google Research Football full game scenarios. A2PO consistently outperforms strong baselines.
翻译:虽然多试剂信任区域算法在解决协调任务方面取得了巨大的成功经验,但其中大多数是因代理商同时更新其政策而遇到的非常态问题。相反,一个逐个更新政策代理商的顺序计算法提供了另一个视角,并显示出强劲的绩效。然而,对每个代理商来说,效率低下和缺乏单一改进保障的样本仍然是顺序计算法的两大挑战。在本文件中,我们建议采用\ textbf{A}逐个更新代理商,并将非常态理论扩大到顺序更新计划。为了评估A2PO,我们就四个基准进行了全面的实证研究:StarCraftII、多试剂多剂化(A2PO)在培训期间对每个代理商进行单一改进的保证。我们证明与其他信任区域算法相比,单项改进是紧凑的。从按顺序更新代理商的角度,我们进一步考虑代理商更新订单的效果,并将非常态理论扩展到连续更新计划。为了评估A2PO,我们就四个基准进行了全面的实证研究:StarCraftII、多试剂研究AJoPO全面基础。