Multi-agent RL is rendered difficult due to the non-stationary nature of environment perceived by individual agents. Theoretically sound methods using the REINFORCE estimator are impeded by its high-variance, whereas value-function based methods are affected by issues stemming from their ad-hoc handling of situations like inter-agent communication. Methods like MADDPG are further constrained due to their requirement of centralized critics etc. In order to address these issues, we present MA-Dreamer, a model-based method that uses both agent-centric and global differentiable models of the environment in order to train decentralized agents' policies and critics using model-rollouts a.k.a `imagination'. Since only the model-training is done off-policy, inter-agent communication/coordination and `language emergence' can be handled in a straight-forward manner. We compare the performance of MA-Dreamer with other methods on two soccer-based games. Our experiments show that in long-term speaker-listener tasks and in cooperative games with strong partial-observability, MA-Dreamer finds a solution that makes effective use of coordination, whereas competing methods obtain marginal scores and fail outright, respectively. By effectively achieving coordination and communication under more relaxed and general conditions, out method opens the door to the study of more complex problems and population-based training.
翻译:使用REINFORCE估计器的理论正确方法受到其高变迁的阻碍,而基于价值功能的方法则受到其临时处理机构间通信等情况所产生的问题的影响; 象MADDPG这样的方法由于需要集中的批评者等而进一步受到限制。 为了解决这些问题,我们介绍了MA-Dreamer, 这是一种基于模型的方法,它既使用代理中心又使用全球差异的环境模型,以便用模型滚动来培训分散的代理人政策和批评者。 由于只有示范培训是非政策性、机构间通信/协调和“语文出现”的,才能直接处理示范培训。 为了解决这些问题,我们将MA-DDROMER的性能与两种足球游戏的其他方法进行比较。 我们的实验表明,在长期的演讲者任务中,以及在合作游戏中,以强烈的局部可耐性为目的,MAD-Dreer找到了一种“想象力 ” ; 由于只有非政策性、机构间通信/协调和“语言出现”的示范性问题才能直接处理。 我们比较了MA-Dreamer在两次足球比赛上的表现和其他方法。