In cooperative multi-agent systems, agents jointly take actions and receive a team reward instead of individual rewards. In the absence of individual reward signals, credit assignment mechanisms are usually introduced to discriminate the contributions of different agents so as to achieve effective cooperation. Recently, the value decomposition paradigm has been widely adopted to realize credit assignment, and QMIX has become the state-of-the-art solution. In this paper, we revisit QMIX from two aspects. First, we propose a new perspective on credit assignment measurement and empirically show that QMIX suffers limited discriminability on the assignment of credits to agents. Second, we propose a gradient entropy regularization with QMIX to realize a discriminative credit assignment, thereby improving the overall performance. The experiments demonstrate that our approach can comparatively improve learning efficiency and achieve better performance.
翻译:在多试剂合作系统中,代理机构共同采取行动并获得团队奖励,而不是个人奖励。在没有个人奖励信号的情况下,通常会引入信用分配机制,以区别不同代理机构的贡献,从而实现有效合作。最近,价值分解范式被广泛采用,以实现信用分配,而QMIX已成为最先进的解决方案。在本文件中,我们从两个方面重新审视QMIX。首先,我们提出了信用分配计量的新观点,从经验上表明,QMIX在向代理机构分配信贷时的不平等性有限。第二,我们建议与QMIX一起实行梯度递增正规化,以实现歧视性信贷分配,从而改善总体绩效。实验表明,我们的方法可以相对提高学习效率,实现更好的业绩。