Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists). This results in solutions where a task cannot be solved without violating the constraints. However, in many real-world cases, constraint violations are undesirable yet they are not catastrophic, motivating the need for soft-constrained RL approaches. We present a soft-constrained RL approach that utilizes meta-gradients to find a good trade-off between expected return and minimizing constraint violations. We demonstrate the effectiveness of this approach by showing that it consistently outperforms the baselines across four different MuJoCo domains.
翻译:部署强化学习(RL)代理商以解决现实世界应用往往需要满足复杂的系统限制。由于系统的复杂性或无法核实离线阈值(例如,不存在模拟器或合理的离线评估程序),限制阈值的设定往往不正确。这导致在不违反限制的情况下无法解决问题的解决方案。然而,在许多现实世界中,限制违规现象是不可取的,但并非灾难性的,促使需要软约束的RL方法。我们提出了一个软约束的RL方法,利用元分级方法在预期返回和尽量减少限制违规之间找到良好的交易。我们通过表明这种方法始终超越了四个不同的 MuJoCo 域的基线来证明这一方法的有效性。