When learning a task as a team, some agents in Multi-Agent Reinforcement Learning (MARL) may fail to understand their true impact in the performance of the team. Such agents end up learning sub-optimal policies, demonstrating undesired lazy behaviours. To investigate this problem, we start by formalising the use of temporal causality applied to MARL problems. We then show how causality can be used to penalise such lazy agents and improve their behaviours. By understanding how their local observations are causally related to the team reward, each agent in the team can adjust their individual credit based on whether they helped to cause the reward or not. We show empirically that using causality estimations in MARL improves not only the holistic performance of the team, but also the individual capabilities of each agent. We observe that the improvements are consistent in a set of different environments.
翻译:在协同工作时,多智能体强化学习(MARL)中的一些代理可能无法理解它们对团队表现的真正影响。这些代理最终会学习到次优策略,表现出不良的懒惰行为。为了解决这个问题,我们首先规范化了将时间因果关系应用到MARL问题中的方法。然后,我们展示了如何使用因果关系来惩罚这些懒惰的代理并改善他们的行为。通过理解它们的本地观察对团队奖励的因果关系,团队中的每个代理都可以根据自己是否有助于导致奖励来调整其个体贡献。我们在实验中证明,使用MARL中的因果关系估计不仅提高了团队的整体表现,也提高了每个代理的个体能力。我们观察到这些改进在一组不同环境中是一致的。