Constrained multiagent reinforcement learning (C-MARL) is gaining importance as MARL algorithms find new applications in real-world systems ranging from energy systems to drone swarms. Most C-MARL algorithms use a primal-dual approach to enforce constraints through a penalty function added to the reward. In this paper, we study the structural effects of the primal-dual approach on the constraints and value function. First, we show that using the constraint evaluation as the penalty leads to a weak notion of safety, but by making simple modifications to the penalty function, we can enforce meaningful probabilistic safety constraints. Second, we exploit the structural effects of primal-dual methods on value functions, leading to improved value estimates. Simulations in a simple constrained multiagent environment show that our reinterpretation of the primal-dual method in terms of probabilistic constraints is meaningful, and that our proposed value estimation procedure improves convergence to a safe joint policy.
翻译:随着MARL算法在从能源系统到无人机群等现实世界系统中找到新的应用,大多数C-MARL算法都采用原始的双重方法,通过附加奖励的处罚功能强制实施限制。在本文中,我们研究了原始的双重方法对限制和价值功能的结构性影响。首先,我们表明,使用约束性评价作为惩罚,会导致安全概念薄弱,但通过简单修改惩罚功能,我们可以实施有意义的概率性安全限制。第二,我们利用原始的双重方法对价值功能的结构影响,从而改进价值估计。在简单受限制的多试剂环境中的模拟表明,我们重新解释原始的双重方法在概率限制方面是有意义的,我们拟议的价值估计程序可以改进对安全联合政策的趋同。