Communication can impressively improve cooperation in multi-agent reinforcement learning (MARL), especially for partially-observed tasks. However, existing works either broadcast the messages leading to information redundancy, or learn targeted communication by modeling all the other agents as targets, which is not scalable when the number of agents varies. In this work, to tackle the scalability problem of MARL communication for partially-observed tasks, we propose a novel framework Transformer-based Email Mechanism (TEM). The agents adopt local communication to send messages only to the ones that can be observed without modeling all the agents. Inspired by human cooperation with email forwarding, we design message chains to forward information to cooperate with the agents outside the observation range. We introduce Transformer to encode and decode the message chain to choose the next receiver selectively. Empirically, TEM outperforms the baselines on multiple cooperative MARL benchmarks. When the number of agents varies, TEM maintains superior performance without further training.
翻译:多试剂强化学习(MARL)中,特别是部分观察任务,通信可显著改善多试剂强化学习(MARL)方面的合作;然而,现有的工作要么传播导致信息冗余的信息,要么通过将所有其他代理作为目标进行模拟,学习有针对性的通信,当代理商的数量不同时,这些代理商是无法伸缩的。在这项工作中,为了解决多试剂强化学习(MARL)中半观察任务的MARL通信的可缩放问题,我们建议采用一个新的框架基于变压器的电子邮件机制(TEM)。代理商采用本地通信,只将信息发送到可以观察到的不以所有代理商为模范的用户。在人类与电子邮件传输合作的激励下,我们设计了信息链以传递信息与观察范围以外的代理商合作。我们引入了变换器,对信息链进行编码和解码,以便有选择地选择下一个接收商。在时间上,TEM比多个合作MARL基准的基线要短。当代理商的数量不同时,TEM在未经进一步培训的情况下保持高级性。