Multi-agent reinforcement learning (MARL) has witnessed significant progress with the development of value function factorization methods. It allows optimizing a joint action-value function through the maximization of factorized per-agent utilities due to monotonicity. In this paper, we show that in partially observable MARL problems, an agent's ordering over its own actions could impose concurrent constraints (across different states) on the representable function class, causing significant estimation error during training. We tackle this limitation and propose PAC, a new framework leveraging Assistive information generated from Counterfactual Predictions of optimal joint action selection, which enable explicit assistance to value function factorization through a novel counterfactual loss. A variational inference-based information encoding method is developed to collect and encode the counterfactual predictions from an estimated baseline. To enable decentralized execution, we also derive factorized per-agent policies inspired by a maximum-entropy MARL framework. We evaluate the proposed PAC on multi-agent predator-prey and a set of StarCraft II micromanagement tasks. Empirical results demonstrate improved results of PAC over state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms on all benchmarks.
翻译:多剂强化学习(MARL)在发展增值功能因素化方法方面取得了显著进展,通过因单一性而使每个试剂公用事业的因数化因子化功用最大化,优化了联合行动-价值功能。在本文件中,我们表明,在部分观察到的MAL问题中,代理商对自身行动的订单可能对可代表功能类别同时施加限制(在不同的国家),造成培训过程中的重大估计错误。我们处理这一局限性,并提议一个利用最佳联合行动选择的反事实预测产生的辅助信息的新框架,即利用最佳联合行动选择的反现实预测产生的辅助信息,使通过新的反事实损失明确协助价值化功用功用功用功用功用功用功用功用功用功用。正在开发一种基于变价的信息编码方法,从估计的基线中收集和编码反事实预测。为了便于分散执行,我们还从最大耐用功用MARL框架中推导出每个试用功用功用功用功用药政策。我们评价了拟议的多剂捕食食食者-预测和一套StarCraft II微观管理任务PAC和一套Starclactalalalal-bal-Agroupleglegy-strat-Supluplightsmmact-Suplationsmupat-Supat-Suplupat-Supat-smleglegismlegsmmmmmmmmmmmismismmmmlementsmmmmlement。