With the development of sensing and communication technologies in networked cyber-physical systems (CPSs), multi-agent reinforcement learning (MARL)-based methodologies are integrated into the control process of physical systems and demonstrate prominent performance in a wide array of CPS domains, such as connected autonomous vehicles (CAVs). However, it remains challenging to mathematically characterize the improvement of the performance of CAVs with communication and cooperation capability. When each individual autonomous vehicle is originally self-interest, we can not assume that all agents would cooperate naturally during the training process. In this work, we propose to reallocate the system's total reward efficiently to motivate stable cooperation among autonomous vehicles. We formally define and quantify how to reallocate the system's total reward to each agent under the proposed transferable utility game, such that communication-based cooperation among multi-agents increases the system's total reward. We prove that Shapley value-based reward reallocation of MARL locates in the core if the transferable utility game is a convex game. Hence, the cooperation is stable and efficient and the agents should stay in the coalition or the cooperating group. We then propose a cooperative policy learning algorithm with Shapley value reward reallocation. In experiments, compared with several literature algorithms, we show the improvement of the mean episode system reward of CAV systems using our proposed algorithm.
翻译:随着网络化网络物理系统(CPS)的遥感和通信技术的开发,基于多试剂强化学习(MARL)方法被纳入物理系统的控制过程,并显示在广泛的CPS领域,例如连接的自主车辆(CAVs)的显著表现;然而,在数学上,仍然难以用通信和合作能力来说明CAVs绩效的改善;当每个自主车辆最初都是自利的,我们无法假定所有代理商在培训过程中会自然合作;在这项工作中,我们提议重新分配该系统的全部奖励,以激励自主车辆之间的稳定合作;我们正式界定和量化如何将系统的全部奖励重新分配给拟议可转让公用事业游戏下的每个代理商,例如连接的自动汽车(CAVs),这样,多代理人之间的交流合作将增加系统的总报酬;我们证明,如果可转让的效用游戏是一种共赢的游戏,那么,我们就不能假定所有代理商都会在核心中进行自然的合作;因此,合作是稳定而高效的,代理商应该留在联盟或合作小组中;我们然后提出一个合作性的政策学习算法,我们用Shaple ASqal oralalal 系统进行几部的升级实验。