Decentralized execution is one core demand in cooperative multi-agent reinforcement learning (MARL). Recently, most popular MARL algorithms have adopted decentralized policies to enable decentralized execution and use gradient descent as their optimizer. However, there is hardly any theoretical analysis of these algorithms taking the optimization method into consideration, and we find that various popular MARL algorithms with decentralized policies are suboptimal in toy tasks when gradient descent is chosen as their optimization method. In this paper, we theoretically analyze two common classes of algorithms with decentralized policies -- multi-agent policy gradient methods and value-decomposition methods to prove their suboptimality when gradient descent is used. In addition, we propose the Transformation And Distillation (TAD) framework, which reformulates a multi-agent MDP as a special single-agent MDP with a sequential structure and enables decentralized execution by distilling the learned policy on the derived ``single-agent" MDP. This approach uses a two-stage learning paradigm to address the optimization problem in cooperative MARL, maintaining its performance guarantee. Empirically, we implement TAD-PPO based on PPO, which can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks.
翻译:分散执行是合作多智能体强化学习(MARL)中的一项核心需求。最近,大多数流行的MARL算法已经采用分散策略来实现分散执行,使用梯度下降作为它们的优化器。然而,在考虑优化方法时,对这些算法进行理论分析几乎没有,当梯度下降作为它们的优化方法时,我们发现各种流行的具有分散策略的MARL算法在玩具任务中都是次优的。在本文中,我们对具有分散策略的两类常见算法进行了理论分析--多智能体策略梯度方法和价值分解方法,证明了它们在使用梯度下降时的亚优性。此外,我们提出了转化和蒸馏(TAD)框架,将多智能体MDP重构为具有连续结构的特殊单智能体MDP,并通过在推导的“单智能体”MDP上蒸馏学习的策略来实现分散执行。这种方法使用两阶段学习范式来解决合作MARL中的优化问题,保持其性能保证。在实证方面,我们基于PPO实现了TAD-PPO,这在有限的多智能体MDP中可以在理论上执行最优策略学习,并在大量合作多智能体任务中表现出明显的优越性。