In this paper, we propose a new mutual information framework for multi-agent reinforcement learning to enable multiple agents to learn coordinated behaviors by regularizing the accumulated return with the simultaneous mutual information between multi-agent actions. By introducing a latent variable to induce nonzero mutual information between multi-agent actions and applying a variational bound, we derive a tractable lower bound on the considered MMI-regularized objective function. The derived tractable objective can be interpreted as maximum entropy reinforcement learning combined with uncertainty reduction of other agents actions. Applying policy iteration to maximize the derived lower bound, we propose a practical algorithm named variational maximum mutual information multi-agent actor-critic, which follows centralized learning with decentralized execution. We evaluated VM3-AC for several games requiring coordination, and numerical results show that VM3-AC outperforms other MARL algorithms in multi-agent tasks requiring high-quality coordination.
翻译:在本文中,我们提出一个新的多试剂强化学习相互信息框架,使多个代理商能够学习协调的行为,办法是通过对累积的回报进行规范化,并同时提供多试剂行动之间的相互信息。通过引入一个潜在变量,促使多试剂行动之间出现非零的相互信息,并应用一个变式界限,我们在考虑的MMI常规目标功能上得出了一个可移动的下限。衍生的可移植目标可被解释为最大增荷学习,同时减少其他代理商行动的不确定性。应用政策迭代以最大限度地增加取自的较低约束,我们提出了一种实用算法,名为变异性最大相互信息多试剂行为者-critic,该算法在分散执行后进行集中学习。我们评估了一些需要协调的游戏的VM3-AC,数字结果显示,在需要高质量协调的多试剂任务中,VM3-AC比其他MARL算法要高得多。</s>