One of the preeminent obstacles to scaling multi-agent reinforcement learning to large numbers of agents is assigning credit to individual agents' actions. In this paper, we address this credit assignment problem with an approach that we call \textit{partial reward decoupling} (PRD), which attempts to decompose large cooperative multi-agent RL problems into decoupled subproblems involving subsets of agents, thereby simplifying credit assignment. We empirically demonstrate that decomposing the RL problem using PRD in an actor-critic algorithm results in lower variance policy gradient estimates, which improves data efficiency, learning stability, and asymptotic performance across a wide array of multi-agent RL tasks, compared to various other actor-critic approaches. Additionally, we relate our approach to counterfactual multi-agent policy gradient (COMA), a state-of-the-art MARL algorithm, and empirically show that our approach outperforms COMA by making better use of information in agents' reward streams, and by enabling recent advances in advantage estimation to be used.
翻译:将多试剂强化学习推广到众多代理商的突出障碍之一是为个别代理商的行动提供信用。在本文中,我们用我们称之为\ textit{部分奖励脱钩}(PRD)的方法来解决这一信用分配问题,这种方法试图将大型合作多试剂RL问题分解为涉及代理商子集的分解子问题,从而简化了信用分配。我们从经验上证明,使用PRD将RL问题分解成一种行为者-官僚算法,导致差异政策梯度估计数降低,这提高了数据效率、学习稳定性和多种代理商RL任务中的零食性性能,而与其他各种行为者-批评方法相比,这提高了数据效率、学习稳定性和零食性性性能。此外,我们将我们的方法与反实际的多试剂政策梯化(COMA)(COMA)(一种最先进的MARL算法)相结合,从经验上表明,我们的方法超越了COMA(COMA),在代理商的奖励流中更好地利用信息,并使得最近优势估计的进展得以使用。