Information sharing is key in building team cognition and enables coordination and cooperation. High-performing human teams also benefit from acting strategically with hierarchical levels of iterated communication and rationalizability, meaning a human agent can reason about the actions of their teammates in their decision-making. Yet, the majority of prior work in Multi-Agent Reinforcement Learning (MARL) does not support iterated rationalizability and only encourage inter-agent communication, resulting in a suboptimal equilibrium cooperation strategy. In this work, we show that reformulating an agent's policy to be conditional on the policies of its neighboring teammates inherently maximizes Mutual Information (MI) lower-bound when optimizing under Policy Gradient (PG). Building on the idea of decision-making under bounded rationality and cognitive hierarchy theory, we show that our modified PG approach not only maximizes local agent rewards but also implicitly reasons about MI between agents without the need for any explicit ad-hoc regularization terms. Our approach, InfoPG, outperforms baselines in learning emergent collaborative behaviors and sets the state-of-the-art in decentralized cooperative MARL tasks. Our experiments validate the utility of InfoPG by achieving higher sample efficiency and significantly larger cumulative reward in several complex cooperative multi-agent domains.
翻译:信息分享是建立团队认知的关键,有利于协调与合作。高绩效的人类团队也受益于以迭代沟通和合理化的等级层次从战略上采取行动,这意味着人类代理在其决策中可以对其团队伙伴的行动进行解释。然而,以前在多机构强化学习(MARL)中的大部分工作并不支持循环合理化,而只是鼓励机构间沟通,从而形成一种不尽人意的平衡合作战略。在这项工作中,我们表明,重新制定代理人的政策以其相邻团队的政策为条件,必然在优化政策梯度(PG)下最大限度地实现相互信息(MI)的低限。基于在约束性合理性和认知等级理论下的决策理念,我们表明,我们经过修改的PG方法不仅不能最大限度地提高当地代理商的回报,而且隐含了代理人之间关于MI的理由,而无需任何明确的调整条件。我们的方法、InfoPG,在学习新兴合作行为时超越了基准,并在分散式合作的MARGE(PG)中设定了更高水平,在高层次的合作性、高水平领域实现。