Multi-agent reinforcement learning (MARL) enables us to create adaptive agents in challenging environments, even when the agents have limited observation. Modern MARL methods have hitherto focused on finding factorized value functions. While this approach has proven successful, the resulting methods have convoluted network structures. We take a radically different approach, and build on the structure of independent Q-learners. Inspired by influence-based abstraction, we start from the observation that compact representations of the observation-action histories can be sufficient to learn close to optimal decentralized policies. Combining this observation with a dueling architecture, our algorithm, LAN, represents these policies as separate individual advantage functions w.r.t. a centralized critic. These local advantage networks condition only on a single agent's local observation-action history. The centralized value function conditions on the agents' representations as well as the full state of the environment. The value function, which is cast aside before execution, serves as a stabilizer that coordinates the learning and to formulate DQN targets during learning. In contrast with other methods, this enables LAN to keep the number of network parameters of its centralized network independent in the number of agents, without imposing additional constraints like monotonic value functions. When evaluated on the StarCraft multi-agent challenge benchmark, LAN shows state-of-the-art performance and scores more than 80% wins in two previously unsolved maps `corridor' and `3s5z_vs_3s6z', leading to an improvement of 10% over QPLEX on average performance on the 14 maps. Moreover when the number of agents becomes large, LAN uses significantly fewer parameters than QPLEX or even QMIX. We thus show that LAN's structure forms a key improvement that helps MARL methods remain scalable.
翻译:多剂加固学习(MARL) 能够让我们在具有挑战性的环境中创建适应性剂, 即使代理商观察有限。 现代 MARL 方法迄今为止一直侧重于寻找因数值功能。 虽然这个方法已经证明是成功的, 但由此产生的方法有混杂的网络结构。 我们采取了完全不同的方法, 并且建立在独立的 Q- learner 的结构上。 在基于影响的抽象学的启发下, 我们从观察- 行动史的缩写可以足够接近最佳的分散化政策。 将这一观察与一个决合结构、 我们的算法、 LAN 结合起来, 将这些政策作为单独的个体优势功能 。 虽然这个方法已经证明是成功的, 但是这些本地优势网络只以单一的代理商的本地观察- 行动历史为条件。 我们采取了一个完全不同的方法, 并且从基于基于基于影响, 观察- 行动史的缩放功能, 是一个稳定器, 来协调学习和制定 DQN 目标的缩略性。 与其他方法相比, 使局 将网络的参数数目保留到其中央网络的参数数数, 独立, 也就是S- snal- breal- deal deal ex ex press real dreal dreal 。