与正规化相对应:多机构强化学习中更好的价值估计 (Softmax with Regularization: Better Value Estimation in Multi-Agent Reinforcement Learning)

Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning, but has received comparatively little attention in the multi-agent setting. In this work, we empirically demonstrate that QMIX, a popular $Q$-learning algorithm for cooperative multi-agent reinforcement learning (MARL), suffers from a particularly severe overestimation problem which is not mitigated by existing approaches. We rectify this by designing a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline and demonstrate its effectiveness in stabilizing learning. We additionally propose to employ a softmax operator, which we efficiently approximate in the multi-agent setting, to further reduce the potential overestimation bias. We demonstrate that our Softmax with Regularization (SR) method, when applied to QMIX, accomplishes its goal of avoiding severe overestimation and significantly improves performance in a variety of cooperative multi-agent tasks. To demonstrate the versatility of our method, we apply it to other $Q$-learning based MARL algorithms and achieve similar performance gains. Finally, we show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.

翻译：高估Q$学习是一个重要问题,在单一试剂强化学习中已对此进行了广泛研究,但在多试剂环境下却相对较少受到重视。在这项工作中,我们从经验上表明,QMIX是合作多试剂强化学习的受欢迎的QMIX(MARL)学习算法,其估算问题特别严重,现有方法并未减轻这一问题。我们通过设计新的基于正规化的更新计划来纠正这一问题,该计划惩罚从基准线上偏离的大规模联合行动价值,并表明其在稳定学习方面的效力。我们还提议使用一个软体积操作器,我们在多试器环境中有效地接近这个操作器,以进一步减少潜在的过高估计偏差。我们证明,我们采用正规化的SOftmax(SR)方法,在应用QMIX时,实现了避免严重高估的目标,大大改进了各种合作性多试剂任务的业绩。为了证明我们的方法的多用途性能,我们将其应用于其他基于美元MARL的学习算法,并实现类似的绩效收益。我们证明,我们用一种具有挑战性的工作规范。