Discovering successful coordinated behaviors is a central challenge in Multi-Agent Reinforcement Learning (MARL) since it requires exploring a joint action space that grows exponentially with the number of agents. In this paper, we propose a mechanism for achieving sufficient exploration and coordination in a team of agents. Specifically, agents are rewarded for contributing to a more diversified team behavior by employing proper intrinsic motivation functions. To learn meaningful coordination protocols, we structure agents' interactions by introducing a novel framework, where at each timestep, an agent simulates counterfactual rollouts of its policy and, through a sequence of computations, assesses the gap between other agents' current behaviors and their targets. Actions that minimize the gap are considered highly influential and are rewarded. We evaluate our approach on a set of challenging tasks with sparse rewards and partial observability that require learning complex cooperative strategies under a proper exploration scheme, such as the StarCraft Multi-Agent Challenge. Our methods show significantly improved performances over different baselines across all tasks.
翻译:发现成功的协调行为是多机构强化学习(MARL)的一个中心挑战,因为它需要探索一个随着代理人人数的激增而成倍增长的联合行动空间。 在本文件中,我们提出了一个机制,以便在一个代理人团队中实现充分的探索和协调。 具体地说,代理人通过使用适当的内在激励功能为更加多样化的团队行为做出贡献而得到奖励。 为了学习有意义的协调协议,我们通过引入一个新颖的框架来构建代理人的互动,在每一个时间步骤上,一个代理人模拟其政策的反事实推出,并通过一系列计算来评估其他代理人当前行为与其目标之间的差距。 尽量减少差距的行动被认为具有高度影响力,并获得奖励。 我们评估了我们关于一套挑战性任务的方法,即少有回报和部分可视性,这需要根据适当的探索计划学习复杂的合作战略,例如StarCraft多机构挑战。 我们的方法显示,在所有任务的不同基线上的表现显著改善。