离线前训练前多风险多方决策变换器:一个大序列模型处理所有 SMAC 任务 (Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks)

Offline reinforcement learning leverages previously-collected offline datasets to learn optimal policies with no necessity to access the real environment. Such a paradigm is also desirable for multi-agent reinforcement learning (MARL) tasks, given the increased interactions among agents and with the enviroment. Yet, in MARL, the paradigm of offline pre-training with online fine-tuning has not been studied, nor datasets or benchmarks for offline MARL research are available. In this paper, we facilitate the research by providing large-scale datasets, and use them to examine the usage of the Decision Transformer in the context of MARL. We investigate the generalisation of MARL offline pre-training in the following three aspects: 1) between single agents and multiple agents, 2) from offline pretraining to the online fine-tuning, and 3) to that of multiple downstream tasks with few-shot and zero-shot capabilities. We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment, and then propose the novel architecture of multi-agent decision transformer (MADT) for effective offline learning. MADT leverages transformer's modelling ability of sequence modelling and integrates it seamlessly with both offline and online MARL tasks. A crucial benefit of MADT is that it learns generalisable policies that can transfer between different types of agents under different task scenarios. On StarCraft II offline dataset, MADT outperforms the state-of-the-art offline RL baselines. When applied to online tasks, the pre-trained MADT significantly improves sample efficiency, and enjoys strong performance both few-short and zero-shot cases. To our best knowledge, this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalisability enhancements in MARL.

翻译：离线强化学习利用先前收集的离线外学习数据库,以学习最佳政策,而无需进入真实环境。鉴于代理人之间和与兴奋的相互作用增加,这种范例对于多剂强化学习(MARL)任务也是可取的。然而,在离线前培训模式中,还没有研究在线微调的离线前培训模式,也没有为离线MARL研究提供数据集或基准。在本文中,我们通过提供大型数据集促进研究,以研究决策变异器在MAL背景下的使用情况。我们调查MARL离线前培训任务的一般情况。我们调查了以下三个方面对MARL离线前培训的一般情况的一般情况,从离线前培训到在线微调的在线前培训模式。 MADTR的离线前期性工作效率(MADT) 和IMDDL的高级数据变异性工作(MADDR) 常规工作效率(在StarftII环境中) 开始引入第一个离线离线的离线数据数据集,然后提出多剂决定变换结构(MACDTT)的新架构,在IMD(MADDT) IMDDT) 测试中大幅地测试中学习了它的关键阶段的升级,在IMDDDT(在IMDL) 测试中进行。