Cooperative multi-agent reinforcement learning (MARL) has made prominent progress in recent years. For training efficiency and scalability, most of the MARL algorithms make all agents share the same policy or value network. However, in many complex multi-agent tasks, different agents are expected to possess specific abilities to handle different subtasks. In those scenarios, sharing parameters indiscriminately may lead to similar behavior across all agents, which will limit the exploration efficiency and degrade the final performance. To balance the training complexity and the diversity of agent behavior, we propose a novel framework to learn dynamic subtask assignment (LDSA) in cooperative MARL. Specifically, we first introduce a subtask encoder to construct a vector representation for each subtask according to its identity. To reasonably assign agents to different subtasks, we propose an ability-based subtask selection strategy, which can dynamically group agents with similar abilities into the same subtask. In this way, agents dealing with the same subtask share their learning of specific abilities and different subtasks correspond to different specific abilities. We further introduce two regularizers to increase the representation difference between subtasks and stabilize the training by discouraging agents from frequently changing subtasks, respectively. Empirical results show that LDSA learns reasonable and effective subtask assignment for better collaboration and significantly improves the learning performance on the challenging StarCraft II micromanagement benchmark and Google Research Football.
翻译:最近几年来,多剂合作强化学习(MARL)取得了显著的进展。为了培训效率和可扩展性,多数MARL算法使所有代理商共享相同的政策或价值网络。然而,在许多复杂的多剂任务中,不同代理商预计将拥有处理不同子任务的具体能力。在这些情景中,共享参数可能会导致所有代理商的类似行为,从而限制勘探效率和降低最后绩效。为了平衡培训复杂性和代理商行为的多样性,我们提议了一个新框架,以在合作MARL中学习动态子任务(LDSA)。具体地说,我们首先引入一个子塔斯克编码器,为每个子任务构建一个矢量代表器。为了合理指派代理商处理不同子任务,我们建议一个基于能力的子任务选择战略,可以将具有类似能力的代理商动态地分组到同一个子任务中,从而限制勘探效率,降低最后的代理商行为。为了平衡培训的复杂能力和不同具体能力,我们进一步引入两个正规化器,以增加子任务单位之间的代表性差异,以便按照其特性为每个子任务构建矢量构建一个矢量的矢量的矢量代表制,并稳定基于稳定SALDLDA的进度,从而不断从稳定地学习的地理基准,从SAL上进行更好的学习。