In multi-agent reinforcement learning (MARL), self-interested agents attempt to establish equilibrium and achieve coordination depending on game structure. However, existing MARL approaches are mostly bound by the simultaneous actions of all agents in the Markov game (MG) framework, and few works consider the formation of equilibrium strategies via asynchronous action coordination. In view of the advantages of Stackelberg equilibrium (SE) over Nash equilibrium, we construct a spatio-temporal sequential decision-making structure derived from the MG and propose an N-level policy model based on a conditional hypernetwork shared by all agents. This approach allows for asymmetric training with symmetric execution, with each agent responding optimally conditioned on the decisions made by superior agents. Agents can learn heterogeneous SE policies while still maintaining parameter sharing, which leads to reduced cost for learning and storage and enhanced scalability as the number of agents increases. Experiments demonstrate that our method effectively converges to the SE policies in repeated matrix game scenarios, and performs admirably in immensely complex settings including cooperative tasks and mixed tasks.
翻译:在多智能体强化学习(MARL)中,自利型智能体试图通过游戏结构来建立均衡并实现协调。然而,现有的MARL方法大多受到马尔可夫博弈(MG)框架下所有智能体同时行动的限制,很少有作品考虑通过异步行动协调形成均衡策略。鉴于Stackelberg均衡比纳什均衡具有更多的优点,我们构建了一个由MG导出的时空序列决策结构,并提出了一个基于所有智能体共享的条件超网络的N级策略模型。该方法允许对称执行条件下的异构训练,每个智能体都可以最优地响应优先智能体的决策。智能体可以学习异构的SE策略,同时仍然保持参数共享,这将减少学习和存储成本,并随着智能体数量的增加而增强可扩展性。实验证明,我们的方法在重复的矩阵游戏场景中有效地趋于SE策略,并在包括协作任务和混合任务在内的极其复杂的环境中表现出色。