We study multi-agent reinforcement learning (MARL) with centralized training and decentralized execution. During the training, new agents may join, and existing agents may unexpectedly leave the training. In such situations, a standard deep MARL model must be trained again from scratch, which is very time-consuming. To tackle this problem, we propose a special network architecture with a few-shot learning algorithm that allows the number of agents to vary during centralized training. In particular, when a new agent joins the centralized training, our few-shot learning algorithm trains its policy network and value network using a small number of samples; when an agent leaves the training, the training process of the remaining agents is not affected. Our experiments show that using the proposed network architecture and algorithm, model adaptation when new agents join can be 100+ times faster than the baseline. Our work is applicable to any setting, including cooperative, competitive, and mixed.
翻译:我们研究多剂强化学习(MARL),进行集中培训和分散执行。在培训期间,新代理人可能加入,现有代理人可能意外地离开培训。在这种情况下,标准的深度MARL模型必须从零开始重新培训,这是非常耗时的。为了解决这个问题,我们建议建立一个特殊的网络结构,配有几张学习算法,允许在集中培训期间不同数目的代理人。特别是,当新代理人加入集中培训时,我们微小的学习算法用少量样本培训其政策网络和价值网络;当代理人离开培训时,剩余代理人的培训过程不会受到影响。我们的实验表明,使用拟议的网络结构和算法,新代理人加入时的模式适应速度可以比基线快100倍以上。我们的工作适用于任何环境,包括合作、竞争和混合环境。