We propose Multi Agent Reflective Policy Optimization (MARPO) to alleviate the issue of sample inefficiency in multi agent reinforcement learning. MARPO consists of two key components: a reflection mechanism that leverages subsequent trajectories to enhance sample efficiency, and an asymmetric clipping mechanism that is derived from the KL divergence and dynamically adjusts the clipping range to improve training stability. We evaluate MARPO in classic multi agent environments, where it consistently outperforms other methods.
翻译:本文提出多智能体反思式策略优化(MARPO),以缓解多智能体强化学习中的样本效率低下问题。MARPO包含两个关键组件:一是利用后续轨迹提升样本效率的反思机制;二是基于KL散度推导的非对称裁剪机制,该机制动态调整裁剪范围以提升训练稳定性。我们在经典多智能体环境中评估了MARPO,其性能始终优于其他方法。