LDSA: 在合作性多机构强化学习中学习动态子任务 (LDSA: Learning Dynamic Subtask Assignment in Cooperative Multi-Agent Reinforcement Learning)

Cooperative multi-agent reinforcement learning (MARL) has made prominent progress in recent years. For training efficiency and scalability, most of the MARL algorithms make all agents share the same policy or value network. However, many complex multi-agent tasks require agents with a variety of specific abilities to handle different subtasks. Sharing parameters indiscriminately may lead to similar behaviors across all agents, which will limit the exploration efficiency and be detrimental to the final performance. To balance the training complexity and the diversity of agents' behaviors, we propose a novel framework for learning dynamic subtask assignment (LDSA) in cooperative MARL. Specifically, we first introduce a subtask encoder that constructs a vector representation for each subtask according to its identity. To reasonably assign agents to different subtasks, we propose an ability-based subtask selection strategy, which can dynamically group agents with similar abilities into the same subtask. Then, we condition the subtask policy on its representation and agents dealing with the same subtask share their experiences to train the subtask policy. We further introduce two regularizers to increase the representation difference between subtasks and avoid agents changing subtasks frequently to stabilize training, respectively. Empirical results show that LDSA learns reasonable and effective subtask assignment for better collaboration and significantly improves the learning performance on the challenging StarCraft II micromanagement benchmark.

翻译：近些年来,合作性多剂强化学习(MARL)取得了显著的进展。为了培训效率和可扩展性,多数MARL算法使所有代理商都拥有相同的政策或价值网络。然而,许多复杂的多剂任务要求具有不同具体能力的代理商处理不同的子任务。共享参数可能导致所有代理商的类似行为,这将限制勘探效率,损害最后的绩效。为了平衡培训的复杂性和代理人行为的多样性,我们提议了一个创新的框架,用于在合作性MARL中学习动态的子任务分配(LDSA)。具体地说,我们首先引入一个子任务编码器,根据每个子任务的身份为每个子任务构建一个矢量代表。为了合理分配代理商处理不同的子任务。我们建议一个基于能力的子任务选择战略,可以动态地将具有类似能力的代理商分组纳入同一个子任务。然后,我们将子任务政策以其代表与处理同一子任务单位的代理商分享经验来培训子任务政策。我们进一步引入两个调控器,以增加子任务单位之间的代表性差异,以便根据它们的身份为每个子任务构建一个矢的矢值构建一个矢的矢值。我们要大大地改进工作上的工作,不断改进分任务的分任务,从而改进分任务的分数学习分任务,使分任务改进分任务,使分任务改进分任务的分任务改进分任务的分任务,使分任务改进分数学习工作,使分任务改进的分任务,使分任务改进的分任务改进工作,使分任务的代理商能够不断学习工作,以便不断改进工作,使分数学习性学习工作。