Cooperative multi-agent tasks require agents to deduce their own contributions with shared global rewards, known as the challenge of credit assignment. General methods for policy based multi-agent reinforcement learning to solve the challenge introduce differentiate value functions or advantage functions for individual agents. In multi-agent system, polices of different agents need to be evaluated jointly. In order to update polices synchronously, such value functions or advantage functions also need synchronous evaluation. However, in current methods, value functions or advantage functions use counter-factual joint actions which are evaluated asynchronously, thus suffer from natural estimation bias. In this work, we propose the approximatively synchronous advantage estimation. We first derive the marginal advantage function, an expansion from single-agent advantage function to multi-agent system. Further more, we introduce a policy approximation for synchronous advantage estimation, and break down the multi-agent policy optimization problem into multiple sub-problems of single-agent policy optimization. Our method is compared with baseline algorithms on StarCraft multi-agent challenges, and shows the best performance on most of the tasks.
翻译:多代理人合作任务要求代理商以共同的全球奖励(称为信用分配的挑战)推断自己的贡献。基于政策的多代理人强化学习通用方法来应对挑战,为个别代理人引入不同的价值功能或优势功能。在多代理人系统中,不同代理人的政策需要联合评估。为了同步更新政策,这种价值功能或优势功能也需要同步评估。然而,在目前的方法中,价值功能或优势功能使用非同步评估的反事实联合行动,因此受到自然估计偏差的影响。在这项工作中,我们提议了近似同步优势估计。我们首先得出边际优势功能,从单一代理人优势功能扩大到多代理人系统。此外,我们引入了同步优势估计的政策近似值,并将多代理人政策优化问题细分为单一代理人政策优化的多个子问题。我们的方法与StarCraft多代理人挑战的基线算法进行了比较,并展示了大部分任务的最佳表现。