A reliable critic is central to on-policy actor-critic learning. But it becomes challenging to learn a reliable critic in a multi-agent sparse reward scenario due to two factors: 1) The joint action space grows exponentially with the number of agents 2) This, combined with the reward sparseness and environment noise, leads to large sample requirements for accurate learning. We show that regularising the critic with spectral normalization (SN) enables it to learn more robustly, even in multi-agent on-policy sparse reward scenarios. Our experiments show that the regularised critic is quickly able to learn from the sparse rewarding experience in the complex SMAC and RWARE domains. These findings highlight the importance of regularisation in the critic for stable learning.
翻译:在策略演员-评论员学习中,一个可靠的评论员是至关重要的。但在多智能体稀疏奖励场景中学习可靠的评论员变得具有挑战性,这主要由两个因素造成:1)随着代理数量的增加,联合动作空间指数级增长2)这会导致需求大量的样本来进行精确学习,再加上奖励稀疏性和环境噪声,使得学习变得困难。我们展示了通过谱归一化(SN)正则化评论员,使其能够在多智能体策略稀疏奖励场景中更加稳定地学习。我们的实验表明,通过正则化评论员,它能够从复杂的SMAC和RWARE领域的稀疏奖励体验中快速学习。这些发现强调了评论员正则化在稳定学习方面的重要性。