Sparse and delayed rewards pose a challenge to single agent reinforcement learning. This challenge is amplified in multi-agent reinforcement learning (MARL) where credit assignment of these rewards needs to happen not only across time, but also across agents. We propose Agent-Time Attention (ATA), a neural network model with auxiliary losses for redistributing sparse and delayed rewards in collaborative MARL. We provide a simple example that demonstrates how providing agents with their own local redistributed rewards and shared global redistributed rewards motivate different policies. We extend several MiniGrid environments, specifically MultiRoom and DoorKey, to the multi-agent sparse delayed rewards setting. We demonstrate that ATA outperforms various baselines on many instances of these environments. Source code of the experiments is available at https://github.com/jshe/agent-time-attention.
翻译:微薄和延迟的奖励对单一代理机构强化学习构成挑战。 这一挑战在多试剂强化学习(MARL)中得到了扩大,因为这种奖励的信用分配不仅需要时间间隔,而且需要跨代理机构。 我们提议采用代理时间关注(ATA)这个神经网络模型,其附带损失在合作的MARL中重新分配稀有和延迟的奖励。 我们提供了一个简单的例子,说明如何向代理机构提供他们自己的当地重新分配的奖励和共享的全球再分配奖励激励不同政策。 我们把多种微型Grid环境,特别是多罗姆和DoorKey扩大到多试剂稀薄的延迟奖励设置。 我们证明, ATA超越了这些环境许多情况下的各种基线。 实验的源代码可在https://github.com/jshe/agent-time-atention查阅。