In cooperative multi-agent reinforcement learning (MARL), combining value decomposition with actor-critic enables agents to learn stochastic policies, which are more suitable for the partially observable environment. Given the goal of learning local policies that enable decentralized execution, agents are commonly assumed to be independent of each other, even in centralized training. However, such an assumption may prohibit agents from learning the optimal joint policy. To address this problem, we explicitly take the dependency among agents into centralized training. Although this leads to the optimal joint policy, it may not be factorized for decentralized execution. Nevertheless, we theoretically show that from such a joint policy, we can always derive another joint policy that achieves the same optimality but can be factorized for decentralized execution. To this end, we propose multi-agent conditional policy factorization (MACPF), which takes more centralized training but still enables decentralized execution. We empirically verify MACPF in various cooperative MARL tasks and demonstrate that MACPF achieves better performance or faster convergence than baselines. Our code is available at https://github.com/PKU-RL/FOP-DMAC-MACPF.
翻译:在合作性多试剂强化学习(MARL)中,将价值分解与行为者-行为者-行为者相结合,使代理商能够学习更适合部分可观察环境的随机政策;鉴于学习有助于分散执行的地方政策的目标,通常认为代理商彼此独立,甚至在集中培训方面也是如此;然而,这种假设可能禁止代理商学习最佳的联合政策;为解决这一问题,我们明确将代理商之间的依赖性纳入集中培训;虽然这会导致最佳的联合政策,但可能不考虑分散执行。然而,从理论上讲,我们从这种联合政策中可以看出,我们总是可以产生另一个联合政策,实现同样的最佳性,但可以分权执行的分权化因素。为此,我们提议多剂有条件的政策因子化(MACPF),这种因子化需要更集中的培训,但仍然能够分散执行。我们从经验上核查MARL各项合作性任务中的MACPF,并证明MACFP取得比基线更好的业绩或更快的趋同性。我们的代码可以在https://github.com/PKU-ROP-DMAC-DMAC-MAC-MACPFPFPF。