Decentralized learning has shown great promise for cooperative multi-agent reinforcement learning (MARL). However, non-stationarity remains a significant challenge in decentralized learning. In the paper, we tackle the non-stationarity problem in the simplest and fundamental way and propose \textit{multi-agent alternate Q-learning} (MA2QL), where agents take turns to update their Q-functions by Q-learning. MA2QL is a \textit{minimalist} approach to fully decentralized cooperative MARL but is theoretically grounded. We prove that when each agent guarantees a $\varepsilon$-convergence at each turn, their joint policy converges to a Nash equilibrium. In practice, MA2QL only requires minimal changes to independent Q-learning (IQL). We empirically evaluate MA2QL on a variety of cooperative multi-agent tasks. Results show MA2QL consistently outperforms IQL, which verifies the effectiveness of MA2QL, despite such minimal changes.
翻译:分散化学习显示合作性多试剂强化学习(MARL)大有希望。然而,非静态仍然是分散化学习的重大挑战。在论文中,我们以最简单和根本的方式解决非静态问题,并提议“MA2QL”,代理转而通过Q学习来更新其功能。MA2QL是完全分散化合作性强化学习(MARL)的一个textit{minististr}方法,但理论上是有根据的。我们证明,当每个代理保证在每一转弯时都有一个$\varepslon$-converggence,他们的联合政策就会与纳什平衡相汇合。在实践中,MA2QL只需要对独立的Q学习(IQL)做最低限度的修改。我们从经验上评估了多种合作性多试剂任务中的MA2QL。结果显示,MA2QL始终超越IQL,它验证了MA2QL的有效性,尽管这种微小的变化。