通过强化组织控制和等级制共识学习实现结构性多样化 (Structured Diversification Emergence via Reinforced Organization Control and Hierarchical Consensus Learning)

When solving a complex task, humans will spontaneously form teams and to complete different parts of the whole task, respectively. Meanwhile, the cooperation between teammates will improve efficiency. However, for current cooperative MARL methods, the cooperation team is constructed through either heuristics or end-to-end blackbox optimization. In order to improve the efficiency of cooperation and exploration, we propose a structured diversification emergence MARL framework named {\sc{Rochico}} based on reinforced organization control and hierarchical consensus learning. {\sc{Rochico}} first learns an adaptive grouping policy through the organization control module, which is established by independent multi-agent reinforcement learning. Further, the hierarchical consensus module based on the hierarchical intentions with consensus constraint is introduced after team formation. Simultaneously, utilizing the hierarchical consensus module and a self-supervised intrinsic reward enhanced decision module, the proposed cooperative MARL algorithm {\sc{Rochico}} can output the final diversified multi-agent cooperative policy. All three modules are organically combined to promote the structured diversification emergence. Comparative experiments on four large-scale cooperation tasks show that {\sc{Rochico}} is significantly better than the current SOTA algorithms in terms of exploration efficiency and cooperation strength.

翻译：在解决复杂任务时,人类将分别自发组成团队,完成整个任务的不同部分。同时,队友之间的合作将提高效率。但是,对于目前合作的MARL方法,合作团队是通过超常或端到端黑盒优化构建的。为了提高合作和探索的效率,我们提议了一个结构化的多样化崛起MARL框架,名为~sc{Rochico ⁇,以强化的组织控制和等级共识学习为基础。~sc{Rochico ⁇ }首先通过组织控制模块学习适应性组合政策,该模块由独立的多剂强化学习建立。此外,基于有共识制约的等级意图的等级共识模块在团队组建后引入。同时,利用等级共识模块和自我监督的内在奖赏强化决策模块,拟议的MARL算法可以输出最后的多样化多剂合作政策。所有三个模块都是有机结合的,以促进结构化多样化的出现。四个大规模合作任务的比较实验表明,在Ssc{Rochico ⁇ 和SOTAA值方面,探索效率强得多。