Current value-based multi-agent reinforcement learning methods optimize individual Q values to guide individuals' behaviours via centralized training with decentralized execution (CTDE). However, such expected, i.e., risk-neutral, Q value is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents in complex environments. To address these issues, we propose RMIX, a novel cooperative MARL method with the Conditional Value at Risk (CVaR) measure over the learned distributions of individuals' Q values. Specifically, we first learn the return distributions of individuals to analytically calculate CVaR for decentralized execution. Then, to handle the temporal nature of the stochastic outcomes during executions, we propose a dynamic risk level predictor for risk level tuning. Finally, we optimize the CVaR policies with CVaR values used to estimate the target in TD error during centralized training and the CVaR values are used as auxiliary local rewards to update the local distribution via Quantile Regression loss. Empirically, we show that our method significantly outperforms state-of-the-art methods on challenging StarCraft II tasks, demonstrating enhanced coordination and improved sample efficiency.
翻译:目前基于价值的多试剂强化学习方法优化了个人质量值,以通过分散执行(CTDE)的集中培训指导个人行为。然而,这种预期值,即风险中和Q值,即使与CTDE相比,也是不够的,因为CTDE的奖励随机性和环境不确定性,导致这些方法未能在复杂环境中培训协调人员。为了解决这些问题,我们建议RMIX(RMIX),这是一种新的合作性MARL方法,与风险条件值(CVaR)相比,衡量个人Q值的学习性分布。具体地说,我们首先了解个人对分析计算用于分散执行的CVAR的回报分布。随后,为了处理处决期间随机性结果的时间性,我们提出了风险水平调适的动态风险水平预测。最后,我们优化CVaR政策,使用CVaR值来估计在集中培训期间的TD误差目标,CVaR值被用来作为当地奖励的辅助性奖励。我们首先通过量化反反向损失更新当地分配情况的当地奖励。我们展示了改进的方法。