We consider the problem of learning to communicate using multi-agent reinforcement learning (MARL). A common approach is to learn off-policy, using data sampled from a replay buffer. However, messages received in the past may not accurately reflect the current communication policy of each agent, and this complicates learning. We therefore introduce a 'communication correction' which accounts for the non-stationarity of observed communication induced by multi-agent learning. It works by relabelling the received message to make it likely under the communicator's current policy, and thus be a better reflection of the receiver's current environment. To account for cases in which agents are both senders and receivers, we introduce an ordered relabelling scheme. Our correction is computationally efficient and can be integrated with a range of off-policy algorithms. We find in our experiments that it substantially improves the ability of communicating MARL systems to learn across a variety of cooperative and competitive tasks.
翻译:我们考虑的是利用多试剂强化学习(MARL)进行交流的学习问题。一个共同的方法是利用从重播缓冲中抽样的数据来学习脱离政策的知识。然而,过去收到的信息可能无法准确地反映每个代理商目前的传播政策,这会使学习复杂化。因此,我们引入了“通信校正”来解释多试剂学习所引发的观测通信的不常态性。通过将收到的信息重新贴上标签,使之有可能在通信员的现行政策下进行,从而更好地反映接收者当前的环境。为了说明代理商既是发件人又是接收人的情况,我们引入了有序的重新标签计划。我们的校正在计算上是有效的,可以与一系列非政策算法相结合。我们在实验中发现,它极大地提高了与MARL系统沟通的能力,以学习各种合作和竞争性的任务。