We present Coordinated Proximal Policy Optimization (CoPPO), an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting. The key idea lies in the coordinated adaptation of step size during the policy update process among multiple agents. We prove the monotonicity of policy improvement when optimizing a theoretically-grounded joint objective, and derive a simplified optimization objective based on a set of approximations. We then interpret that such an objective in CoPPO can achieve dynamic credit assignment among agents, thereby alleviating the high variance issue during the concurrent update of agent policies. Finally, we demonstrate that CoPPO outperforms several strong baselines and is competitive with the latest multi-agent PPO method (i.e. MAPPO) under typical multi-agent settings, including cooperative matrix games and the StarCraft II micromanagement tasks.
翻译:我们提出了协调准政策优化的算法(CoPPPO),这一算法将最初的准政策优化(PPO)扩展至多试剂环境,关键的想法是在政策更新过程中协调调整多个代理商的职级规模。我们在优化基于理论上的共同目标时证明了政策改进的单一性,并根据一套近似得出简化的优化目标。然后我们解释,CoPPPO的这样一个目标可以实现代理商之间的动态信用分配,从而缓解同时更新代理商政策期间的高度差异问题。最后,我们证明,在典型的多试剂环境下,包括在合作性矩阵游戏和StarCraft II微观管理任务下,COPOPO比几个强有力的基线强,并与最新的多试PO方法(即MAPO)竞争。