Cooperative multi-agent reinforcement learning (MARL) requires agents to explore to learn to cooperate. Existing value-based MARL algorithms commonly rely on random exploration, such as $\epsilon$-greedy, which is inefficient in discovering multi-agent cooperation. Additionally, the environment in MARL appears non-stationary to any individual agent due to the simultaneous training of other agents, leading to highly variant and thus unstable optimisation signals. In this work, we propose ensemble value functions for multi-agent exploration (EMAX), a general framework to extend any value-based MARL algorithm. EMAX trains ensembles of value functions for each agent to address the key challenges of exploration and non-stationarity: (1) The uncertainty of value estimates across the ensemble is used in a UCB policy to guide the exploration of agents to parts of the environment which require cooperation. (2) Average value estimates across the ensemble serve as target values. These targets exhibit lower variance compared to commonly applied target networks and we show that they lead to more stable gradients during the optimisation. We instantiate three value-based MARL algorithms with EMAX, independent DQN, VDN and QMIX, and evaluate them in 21 tasks across four environments. Using ensembles of five value functions, EMAX improves sample efficiency and final evaluation returns of these algorithms by 54%, 55%, and 844%, respectively, averaged all 21 tasks.
翻译:合作性多试剂强化学习(MARL)要求代理商探索如何合作。现有的基于价值的MARL算法通常依赖于随机探索,如$\epsilon$-greedy,这在发现多试剂合作方面效率低下。此外,由于同时培训其他代理商,导致高度变异并因而不稳定的优化信号,MARL的环境似乎对任何个体代理商来说都不是静止的。在这项工作中,我们提议多试剂勘探(EMAX)的混合值功能(EMAX)是扩展任何基于价值的MARL算法(EMAX)的总框架。 EMAX培训每个代理商的价值功能集合组,以应对探索和非常态性的关键挑战:(1) 联合体的数值估计的不确定性被UCB政策用于指导对需要合作的部分环境的勘探。(2) 整个联合体的平均价值估计值作为目标值。 这些目标与通常应用的目标网络相比差异较小,我们评估这些目标在优化期间导致更稳定的梯度。</s>