Decentralized cooperative multi-agent deep reinforcement learning (MARL) can be a versatile learning framework, particularly in scenarios where centralized training is either not possible or not practical. One of the key challenges in decentralized deep MARL is the non-stationarity of the learning environment when multiple agents are learning concurrently. A commonly used and efficient scheme for decentralized MARL is independent learning in which agents concurrently update their policies independent of each other. We first show that independent learning does not always converge, while sequential learning where agents update their policies one after another in a sequence is guaranteed to converge to an agent-by-agent optimal solution. In sequential learning, when one agent updates its policy, all other agent's policies are kept fixed, alleviating the challenge of non-stationarity due to concurrent updates in other agents' policies. However, it can be slow because only one agent is learning at any time. Therefore it might also not always be practical. In this work, we propose a decentralized cooperative MARL algorithm based on multi-timescale learning. In multi-timescale learning, all agents learn concurrently, but at different learning rates. In our proposed method, when one agent updates its policy, other agents are allowed to update their policies as well, but at a slower rate. This speeds up sequential learning, while also minimizing non-stationarity caused by other agents updating concurrently. Multi-timescale learning outperforms state-of-the-art decentralized learning methods on a set of challenging multi-agent cooperative tasks in the epymarl (papoudakis2020) benchmark. This can be seen as a first step towards more general decentralized cooperative deep MARL methods based on multi-timescale learning.
翻译:分散化的多代理人深层强化学习(MARL)可以是一个多用途学习框架,特别是在集中化培训要么不可能,要么不切实际的情况下。分散化的深层MARL的关键挑战之一是,当多个代理人同时学习时,学习环境不固定。分散化的MARL的常用和高效计划是独立学习,其中各代理人相互独立更新政策。我们首先表明,独立学习并不总是相互趋同,而连续学习,其中各代理人按顺序不断更新其政策,保证逐个代理人学习到最佳解决办法。在连续学习中,当一个代理人更新其政策时,所有其他代理人的政策保持不变,由于其他代理人政策同时更新而减轻了不固定化环境的挑战。然而,由于只有一个代理人随时学习,分散化的办法是缓慢的。因此,也可能不切实际的。在这项工作中,我们建议以多时间尺度学习为基础的分散化的MARL算法。在多时间级学习中,所有代理人都能同时学习,但以不同的学习速度。在我们拟议的方法中,当一个代理人更新其电子更新其深层次的多层次更新政策时,其他代理人也可以通过不断更新其平级的学习方法。