Multi-agent control problems constitute an interesting area of application for deep reinforcement learning models with continuous action spaces. Such real-world applications, however, typically come with critical safety constraints that must not be violated. In order to ensure safety, we enhance the well-known multi-agent deep deterministic policy gradient (MADDPG) framework by adding a safety layer to the deep policy network. In particular, we extend the idea of linearizing the single-step transition dynamics, as was done for single-agent systems in Safe DDPG (Dalal et al., 2018), to multi-agent settings. We additionally propose to circumvent infeasibility problems in the action correction step using soft constraints (Kerrigan & Maciejowski, 2000). Results from the theory of exact penalty functions can be used to guarantee constraint satisfaction of the soft constraints under mild assumptions. We empirically find that the soft formulation achieves a dramatic decrease in constraint violations, making safety available even during the learning procedure.
翻译:多试剂控制问题是具有连续行动空间的深强化学习模式的一个有趣的应用领域,然而,这种现实世界应用通常会产生关键的安全限制,不得违反。为了确保安全,我们通过在深层政策网络中增加一个安全层来加强众所周知的多试剂深度确定政策梯度(MADDPG)框架。特别是,我们把单步过渡动态线线化的想法推广到多试剂环境,如安全DDPG(Dalal等,2018年)的单一试剂系统那样。我们还建议利用软约束(Kerrigan & Maciejowski,2000年)避免行动纠正步骤中的不可行性问题。精确惩罚功能理论的结果可用来保证在温和假设下限制对软约束的满足。我们从经验上发现软配方在限制违规方面实现了大幅度的减少,甚至在学习过程中也能提供安全。