多机构强化学习的模拟反事实信用 (Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning)

Centralized Training with Decentralized Execution (CTDE) has been a popular paradigm in cooperative Multi-Agent Reinforcement Learning (MARL) settings and is widely used in many real applications. One of the major challenges in the training process is credit assignment, which aims to deduce the contributions of each agent according to the global rewards. Existing credit assignment methods focus on either decomposing the joint value function into individual value functions or measuring the impact of local observations and actions on the global value function. These approaches lack a thorough consideration of the complicated interactions among multiple agents, leading to an unsuitable assignment of credit and subsequently mediocre results on MARL. We propose Shapley Counterfactual Credit Assignment, a novel method for explicit credit assignment which accounts for the coalition of agents. Specifically, Shapley Value and its desired properties are leveraged in deep MARL to credit any combinations of agents, which grants us the capability to estimate the individual credit for each agent. Despite this capability, the main technical difficulty lies in the computational complexity of Shapley Value who grows factorially as the number of agents. We instead utilize an approximation method via Monte Carlo sampling, which reduces the sample complexity while maintaining its effectiveness. We evaluate our method on StarCraft II benchmarks across different scenarios. Our method outperforms existing cooperative MARL algorithms significantly and achieves the state-of-the-art, with especially large margins on tasks with more severe difficulties.

翻译：分散执行的集中培训是多机构强化学习合作(MARL)环境中的流行范例,并被广泛用于许多实际应用。培训过程中的主要挑战之一是信用分配,目的是根据全球奖励推断每个代理机构的贡献。现有的信用分配方法侧重于将联合价值功能分解成个人价值功能,或衡量地方观察和行动对全球价值功能的影响。这些方法缺乏对多种代理机构之间复杂互动的彻底考虑,导致不适当地分配信贷,并随后在MARL上取得中等结果。我们建议采用夏普利反事实信用分配,这是明确信用分配的一种新颖方法,为各种代理机构联合提供账户。具体地说,在深度信用分配中,Shapley价值及其预期的特性被利用于任何代理机构组合,从而使我们能够估算每个代理机构的个人信贷能力。尽管有这种能力,但主要的技术困难在于各种代理机构之间的计算复杂程度,这些代理机构在因因素的增加而成倍增。我们通过蒙特卡罗尔的严格采样式方法,特别是降低其抽样复杂性,同时保持我们目前的合作性基准。