Multi-agent control problems constitute an interesting area of application for deep reinforcement learning models with continuous action spaces. Such real-world applications, however, typically come with critical safety constraints that must not be violated. In order to ensure safety, we enhance the well-known multi-agent deep deterministic policy gradient (MADDPG) framework by adding a safety layer to the deep policy network. %which automatically corrects invalid actions. In particular, we extend the idea of linearizing the single-step transition dynamics, as was done for single-agent systems in Safe DDPG (Dalal et al., 2018), to multi-agent settings. We additionally propose to circumvent infeasibility problems in the action correction step using soft constraints (Kerrigan & Maciejowski, 2000). Results from the theory of exact penalty functions can be used to guarantee constraint satisfaction of the soft constraints under mild assumptions. We empirically find that the soft formulation achieves a dramatic decrease in constraint violations, making safety available even during the learning procedure.
翻译:多剂控制问题构成一个有趣的应用领域,用于具有连续行动空间的深度强化学习模式。然而,这种现实世界应用通常带来关键的安全限制,决不能被违反。为了确保安全,我们通过在深层政策网络中增加一个安全层来加强众所周知的多剂深度确定政策梯度(MADDPG)框架。% 自动纠正无效行为。特别是,我们把单步过渡动态线化的想法推广到多剂环境,如安全DDPG(Dalal等人,2018年)的单剂系统那样。我们还建议利用软约束(Kerrigan & Maciejowski,2000年)来避免行动纠正步骤中的不可行问题。精确惩罚功能理论的结果可以用来保证在温和假设下限制软约束的满足。我们从经验中发现,软方配方在限制违规方面实现了大幅度的减少,甚至在学习过程中也能提供安全。