UPDET: 通过政策与变形器脱钩,普及多剂强化学习 (UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers)

Recent advances in multi-agent reinforcement learning have been largely limited in training one model from scratch for every new task. The limitation is due to the restricted model architecture related to fixed input and output dimensions. This hinders the experience accumulation and transfer of the learned agent over tasks with diverse levels of difficulty (e.g. 3 vs 3 or 5 vs 6 multi-agent games). In this paper, we make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks with the requirement of different observation and action configurations. Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy by decoupling the policy distribution from the intertwined input observation with an importance weight measured by the merits of the self-attention mechanism. Compared to a standard transformer block, the proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable. UPDeT is general enough to be plugged into any multi-agent reinforcement learning pipeline and equip them with strong generalization abilities that enables the handling of multiple tasks at a time. Extensive experiments on large-scale SMAC multi-agent competitive games demonstrate that the proposed UPDeT-based multi-agent reinforcement learning achieves significant results relative to state-of-the-art approaches, demonstrating advantageous transfer capability in terms of both performance and training speed (10 times faster).

翻译：多试剂强化学习的最近进展在为每项新任务从零开始培训一个模式方面基本上有限,因为每个新任务都从头到尾培训一个模式;限制的原因是,与固定投入和产出层面有关的模式架构有限,这阻碍了在困难程度不同的任务(例如3对3对3或5对6对6多试玩游戏)上积累和转移经验丰富的代理机构的经验和转移经验(例如,3对3对3或5对6对6对6对多试玩游戏)。在本文件中,我们第一次尝试探索一个通用多试剂强化学习管道,设计一个单一架构,以适应不同观察和行动配置的要求。与以往的RNNN模型不同,我们使用基于变压器的模式来制定灵活的政策,将政策分配与相互交错的投入观测脱钩,根据自留机制的优点衡量其重要性。与标准的变压器块相比,拟议模式称为通用政策脱钩变换变换变换器(UPDeT),进一步放宽行动限制,使多试管任务更便于解释。UDT基础已相当普遍地纳入任何多试剂强化的强化学习管道,使其具备强大的投入的大规模升级的升级的升级方法,使其具备强大的相对升级的升级的升级能力,从而能够在大规模进行多级的升级的升级的升级的试制式的试制。