Proximal Policy Optimization (PPO) is a popular on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due the belief that on-policy methods are significantly less sample efficient than their off-policy counterparts in multi-agent problems. In this work, we investigate Multi-Agent PPO (MAPPO), a variant of PPO which is specialized for multi-agent settings. Using a 1-GPU desktop, we show that MAPPO achieves surprisingly strong performance in three popular multi-agent testbeds: the particle-world environments, the Starcraft multi-agent challenge, and the Hanabi challenge, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. In the majority of environments, we find that compared to off-policy baselines, MAPPO achieves strong results while exhibiting comparable sample efficiency. Finally, through ablation studies, we present the implementation and algorithmic factors which are most influential to MAPPO's practical performance.
翻译:最佳政策优化(PPO)是一种受欢迎的政策强化学习算法,但在多试剂环境中的利用远远少于非政策学习算法,这往往是因为相信在多试剂问题上,政策性方法的抽样效率大大低于非政策性对应方。在这项工作中,我们调查多剂政策优化(MAPO)的变种PPPO(MAPO),该变种专门用于多剂性能。我们使用1-GPU桌面,显示MAPO在三种受欢迎的多剂试样中取得了惊人的强效性能:粒子-世界环境、星体多剂挑战以及汉纳比挑战,只有极少的超参数调整,而且没有任何特定领域的算法修改或结构。在大多数环境中,我们发现与非政策基线相比,MAPOPO在展示可比的样本效率的同时取得了显著的成果。最后,我们通过对比研究,介绍了对MAPPO的实际业绩最有影响力的执行和算法因素。