Large sequence model (SM) such as GPT series and BERT has displayed outstanding performance and generalization capabilities on vision, language, and recently reinforcement learning tasks. A natural follow-up question is how to abstract multi-agent decision making into an SM problem and benefit from the prosperous development of SMs. In this paper, we introduce a novel architecture named Multi-Agent Transformer (MAT) that effectively casts cooperative multi-agent reinforcement learning (MARL) into SM problems wherein the task is to map agents' observation sequence to agents' optimal action sequence. Our goal is to build the bridge between MARL and SMs so that the modeling power of modern sequence models can be unleashed for MARL. Central to our MAT is an encoder-decoder architecture which leverages the multi-agent advantage decomposition theorem to transform the joint policy search problem into a sequential decision making process; this renders only linear time complexity for multi-agent problems and, most importantly, endows MAT with monotonic performance improvement guarantee. Unlike prior arts such as Decision Transformer fit only pre-collected offline data, MAT is trained by online trials and errors from the environment in an on-policy fashion. To validate MAT, we conduct extensive experiments on StarCraftII, Multi-Agent MuJoCo, Dexterous Hands Manipulation, and Google Research Football benchmarks. Results demonstrate that MAT achieves superior performance and data efficiency compared to strong baselines including MAPPO and HAPPO. Furthermore, we demonstrate that MAT is an excellent few-short learner on unseen tasks regardless of changes in the number of agents. See our project page at https://sites.google.com/view/multi-agent-transformer.
翻译:GPT系列和BERT等大型序列模型(SM)展示了在视觉、语言和最近强化学习任务方面的杰出业绩和普及能力。一个自然的后续问题是如何将多剂决策抽象化成SM问题,并从SMM的繁荣发展中受益。在本文件中,我们引入了一个名为多剂变异器(MAT)的新结构,有效地将合作性多剂强化学习(MARL)引入SM问题,其中的任务是根据代理商的最佳行动顺序绘制其观测序列。我们的目标是在MARL和SMMM之间搭建桥梁,以便MARL能够启动现代序列模型的建模能力。我们MAT的中心是一个解密器-解码结构,它利用多剂优势将联合政策搜索问题转换为顺序决策进程;这只给多剂问题带来线性的时间复杂性,而且最重要的是,用单质性动作改进MATTAT。 与前艺术(MAT) 仅适合离线式转动的硬性能,现代序列模型模型能为MLOAT的强大数据。MAT在高端数据流测试中,MAAT进行广泛的实验,我们从SO-DLADERT的实验中,我们从实验中学习了广泛的实验和M-DLDLDLDUDL。我们从一个实验到一个高级实验环境。在SUDUDUDUDLDLDLILDMDMDMDMDMD的实验,我们从一个实验,在高级实验。在高级实验中,从一个实验中,在SDLLLLLLLLL。在SDLDLDLDIDLDLDLDLDLDLD。我们从一个实验中,从一个实验中,从一个实验到一个实验中,从一个实验到一个实验。