Constrained multiagent reinforcement learning (C-MARL) is gaining importance as MARL algorithms find new applications in real-world systems ranging from energy systems to drone swarms. Most C-MARL algorithms use a primal-dual approach to enforce constraints through a penalty function added to the reward. In this paper, we study the structural effects of this penalty term on the MARL problem. First, we show that the standard practice of using the constraint function as the penalty leads to a weak notion of safety. However, by making simple modifications to the penalty term, we can enforce meaningful probabilistic (chance and conditional value at risk) constraints. Second, we quantify the effect of the penalty term on the value function, uncovering an improved value estimation procedure. We use these insights to propose a constrained multiagent advantage actor critic (C-MAA2C) algorithm. Simulations in a simple constrained multiagent environment affirm that our reinterpretation of the primal-dual method in terms of probabilistic constraints is effective, and that our proposed value estimate accelerates convergence to a safe joint policy.
翻译:随着MARL算法在从能源系统到无人机群等现实世界系统中找到新的应用,大多数C-MARL算法都采用原始的双重方法,通过附加奖励的处罚功能来强制实施限制。在本文中,我们研究了这一惩罚术语对MARL问题的结构性影响。首先,我们表明,使用约束功能作为惩罚的标准做法导致一种薄弱的安全概念。然而,通过简单修改惩罚术语,我们可以实施有意义的概率(风险中选择和有条件价值)限制。第二,我们量化惩罚术语对价值功能的影响,发现一个更好的价值估计程序。我们利用这些洞察力提出一个受限制的多剂优势行为者评论家(C-MAA2C)算法。在一个简单受限制的多剂环境中的模拟证实,我们重新解释在概率制约方面原始方法是有效的,我们提出的价值估计加快了与安全联合政策的趋同。