This paper introduces Multi-Agent MDP Homomorphic Networks, a class of networks that allows distributed execution using only local information, yet is able to share experience between global symmetries in the joint state-action space of cooperative multi-agent systems. In cooperative multi-agent systems, complex symmetries arise between different configurations of the agents and their local observations. For example, consider a group of agents navigating: rotating the state globally results in a permutation of the optimal joint policy. Existing work on symmetries in single agent reinforcement learning can only be generalized to the fully centralized setting, because such approaches rely on the global symmetry in the full state-action spaces, and these can result in correspondences across agents. To encode such symmetries while still allowing distributed execution we propose a factorization that decomposes global symmetries into local transformations. Our proposed factorization allows for distributing the computation that enforces global symmetries over local agents and local interactions. We introduce a multi-agent equivariant policy network based on this factorization. We show empirically on symmetric multi-agent problems that distributed execution of globally symmetric policies improves data efficiency compared to non-equivariant baselines.
翻译:本文介绍多位代理商 MDP 单剂强化学习中的现有对称性工作,只能推广到完全集中的设置中,因为这种方法依赖于全州行动空间的全球对称性,而且可以导致跨代理商之间的对应。在合作性多试剂系统中,在代理商的不同配置及其当地观测之间出现复杂的对称性。例如,考虑一组导航代理商:在最佳联合政策调整中旋转国家的全球结果;现有单一代理商强化学习中的对称性工作只能推广到完全集中的设置中,因为这种方法依赖于全州行动空间的全球对称性,而且这些对称性可以产生跨代理商的对应性。为了在仍然允许分散执行的情况下对此种对称性进行编码,我们提出了一种分解将全球对称性纳入地方转型的因子化因素。我们提议的因子化可以分配计算方法,用以执行全球对本地代理商和地方互动的对称性。我们采用了基于这一因子化的多代理商异性政策网络,因为这种方法取决于整个州行动空间的全球对称性,并且可能导致跨代理商之间的对应性。我们展示了全球对等性基准性执行效率政策。我们展示了全球对等性的数据测量性,以便改进了全球对等性基准执行。