Recent success in cooperative multi-agent reinforcement learning (MARL) relies on centralized training and policy sharing. Centralized training eliminates the issue of non-stationarity MARL yet induces large communication costs, and policy sharing is empirically crucial to efficient learning in certain tasks yet lacks theoretical justification. In this paper, we formally characterize a subclass of cooperative Markov games where agents exhibit a certain form of homogeneity such that policy sharing provably incurs no suboptimality. This enables us to develop the first consensus-based decentralized actor-critic method where the consensus update is applied to both the actors and the critics while ensuring convergence. We also develop practical algorithms based on our decentralized actor-critic method to reduce the communication cost during training, while still yielding policies comparable with centralized training.
翻译:最近多试剂合作强化学习的成功取决于集中培训和政策共享。集中培训消除了非常态性MARL问题,但引发了大量的通信成本,而政策共享对于在某些任务中有效学习具有经验上的重要性,但缺乏理论依据。在本文中,我们正式将马尔科夫合作游戏归类为 " 合作型马尔科夫游戏 " 子类,代理展示了某种形式的同质性,因此,共享政策不会产生不尽善的结果。这使我们能够开发第一种基于共识的分散式行为者-批评方法,在这种方法中,对行为者和批评者都适用共识更新,同时确保趋同。我们还根据分散式的行为者-批评方法开发了实用的算法,以减少培训过程中的通信成本,同时仍然产生与集中培训相类似的政策。