Many advances in cooperative multi-agent reinforcement learning (MARL) are based on two common design principles: value decomposition and parameter sharing. A typical MARL algorithm of this fashion decomposes a centralized Q-function into local Q-networks with parameters shared across agents. Such an algorithmic paradigm enables centralized training and decentralized execution (CTDE) and leads to efficient learning in practice. Despite all the advantages, we revisit these two principles and show that in certain scenarios, e.g., environments with a highly multi-modal reward landscape, value decomposition, and parameter sharing can be problematic and lead to undesired outcomes. In contrast, policy gradient (PG) methods with individual policies provably converge to an optimal solution in these cases, which partially supports some recent empirical observations that PG can be effective in many MARL testbeds. Inspired by our theoretical analysis, we present practical suggestions on implementing multi-agent PG algorithms for either high rewards or diverse emergent behaviors and empirically validate our findings on a variety of domains, ranging from the simplified matrix and grid-world games to complex benchmarks such as StarCraft Multi-Agent Challenge and Google Research Football. We hope our insights could benefit the community towards developing more general and more powerful MARL algorithms. Check our project website at https://sites.google.com/view/revisiting-marl.
翻译:合作性多试剂强化学习(MARL)的许多进展基于两个共同的设计原则:价值分解和参数共享。这种时装的典型MARL算法将中央Q功能分解成地方Q网络,各代理共享参数共享。这种算法范式使集中培训和分散执行(CTDE)成为中央化培训和分散执行(CTDE),并导致实践中的高效学习。尽管存在所有这些优势,但我们还是重新审视了这两项原则,并表明在某些情景中,例如,具有高度多模式奖励景观、价值分解和参数共享的环境可能会产生问题,导致不理想的结果。相比之下,政策梯度(PG)方法与个别政策可以相互融合,从而在这些案例中找到最佳的解决办法,其中部分支持最近的一些经验性观察,即PG可以在许多MARL测试台有效。我们理论分析的启发下,我们提出了关于实施多试剂PG算法的实用建议,用于高奖赏或多种新兴行为,以及经验性地验证我们在各个领域的发现结果,从简化的矩阵和网格-世界游戏到诸如StarCft多功能/MARGOGOGorgol 更有利于我们在GIGOGOGO和GIGOGOGOGOGOGOGO和GIGIGIGOLGOGOGO。我们更接近GOGO。我们更接近GOGL更接近GOGOGOGOGOGOGO。我们更接近GOLGOGOGOGOGOGOGOGOGO。