In multi-agent reinforcement learning (MARL), many popular methods, such as VDN and QMIX, are susceptible to a critical multi-agent pathology known as relative overgeneralization (RO), which arises when the optimal joint action's utility falls below that of a sub-optimal joint action in cooperative tasks. RO can cause the agents to get stuck into local optima or fail to solve tasks that require significant coordination between agents within a given timestep. Recent value-based MARL algorithms such as QPLEX and WQMIX can overcome RO to some extent. However, our experimental results show that they can still fail to solve cooperative tasks that exhibit strong RO. In this work, we propose a novel approach called curriculum learning for relative overgeneralization (CURO) to better overcome RO. To solve a target task that exhibits strong RO, in CURO, we first fine-tune the reward function of the target task to generate source tasks that are tailored to the current ability of the learning agent and train the agent on these source tasks first. Then, to effectively transfer the knowledge acquired in one task to the next, we use a novel transfer learning method that combines value function transfer with buffer transfer, which enables more efficient exploration in the target task. We demonstrate that, when applied to QMIX, CURO overcomes severe RO problem and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.
翻译:在多试剂强化学习(MARL)中,许多流行的方法,如VDN和QMIX等,都容易采用被称为相对过度普及(RO)的关键多试剂病理学。当最佳联合行动的效用低于合作任务中亚最佳联合行动的效用时,就会出现这种病理病理学。RO可以使代理人陷入当地奥秘,或者无法解决需要在特定时间步骤内需要代理人之间进行重大协调的任务。最近基于价值的MARL算法,如QPLEX和WQMIX等,可以在某种程度上克服RO。然而,我们的实验结果表明,它们仍然无法解决显示强有力的RO的合作任务。在这项工作中,我们建议采用一种新颖的方法,即为相对过度普及(CUR)课程学习,以更好地克服RO。要解决一项在CUR中显示强有力的RO(O),我们首先调整目标任务的奖励功能,以产生与学习代理人目前的能力相适应的源码任务,并培训代理人执行这些源任务。然后,有效地将一个任务获得的知识转移到下一个任务中,我们用一种新颖的转移方法,在深度的考试任务中, 学习如何大幅度地克服ROM的转移。