Current approaches to multi-agent cooperation rely heavily on centralized mechanisms or explicit communication protocols to ensure convergence. This paper studies the problem of distributed multi-agent learning without resorting to explicit coordination schemes. The proposed algorithm (DM$^2$) leverages distribution matching to facilitate independent agents' coordination. Each individual agent matches a target distribution of concurrently sampled trajectories from a joint expert policy. The theoretical analysis shows that under some conditions, if each agent optimizes their individual distribution matching objective, the agents increase a lower bound on the objective of matching the joint expert policy, allowing convergence to the joint expert policy. Further, if the distribution matching objective is aligned with a joint task, a combination of environment reward and distribution matching reward leads to the same equilibrium. Experimental validation on the StarCraft domain shows that combining the reward for distribution matching with the environment reward allows agents to outperform a fully distributed baseline. Additional experiments probe the conditions under which expert demonstrations need to be sampled in order to outperform the fully distributed baseline.
翻译:目前的多代理人合作办法在很大程度上依赖集中机制或明确的通信协议来确保趋同。本文件研究了分散多代理人学习的问题,而没有采用明确的协调办法。提议的算法(DM$$2$)利用分配匹配来便利独立代理人的协调。每个个别代理人都匹配联合专家政策同时抽样的轨迹的目标分布。理论分析表明,在某些条件下,如果每个代理人优化了各自的分配匹配目标,这些代理人在匹配联合专家政策的目标方面增加了较低的约束,从而能够与联合专家政策接轨。此外,如果分配匹配目标与一项联合任务相一致,那么环境奖励和分配匹配奖励相结合就会导致同一平衡。在StarCraft域的实验性验证表明,将分配奖励与环境奖励相结合,可使代理人超越一个完全分布的基线。其他实验探索了专家示范在何种条件下需要取样才能超越完全分布的基线。