In safety-critical domains such as robotics, navigation and power systems, constrained optimization problems arise where maximizing performance must be carefully balanced with associated constraints. Safe reinforcement learning provides a framework to address these challenges, with Lagrangian methods being a popular choice. However, the effectiveness of Lagrangian methods crucially depends on the choice of the Lagrange multiplier $\lambda$, which governs the trade-off between return and constraint cost. A common approach is to update the multiplier automatically during training. Although this is standard in practice, there remains limited empirical evidence on the robustness of an automated update and its influence on overall performance. Therefore, we analyze (i) optimality and (ii) stability of Lagrange multipliers in safe reinforcement learning across a range of tasks. We provide $\lambda$-profiles that give a complete visualization of the trade-off between return and constraint cost of the optimization problem. These profiles show the highly sensitive nature of $\lambda$ and moreover confirm the lack of general intuition for choosing the optimal value $\lambda^*$. Our findings additionally show that automated multiplier updates are able to recover and sometimes even exceed the optimal performance found at $\lambda^*$ due to the vast difference in their learning trajectories. Furthermore, we show that automated multiplier updates exhibit oscillatory behavior during training, which can be mitigated through PID-controlled updates. However, this method requires careful tuning to achieve consistently better performance across tasks. This highlights the need for further research on stabilizing Lagrangian methods in safe reinforcement learning. The code used to reproduce our results can be found at https://github.com/lindsayspoor/Lagrangian_SafeRL.
翻译:在机器人学、导航和电力系统等安全关键领域,常常出现需要在最大化性能与满足相关约束之间进行谨慎权衡的约束优化问题。安全强化学习为解决这些挑战提供了一个框架,其中拉格朗日方法是一种流行的选择。然而,拉格朗日方法的有效性关键取决于拉格朗日乘子 $\lambda$ 的选择,它控制着回报与约束成本之间的权衡。一种常见的方法是在训练期间自动更新乘子。尽管这在实践中是标准做法,但关于自动更新的鲁棒性及其对整体性能影响的实证证据仍然有限。因此,我们在一系列任务中分析了安全强化学习中拉格朗日乘子的(i)最优性和(ii)稳定性。我们提供了 $\lambda$ 曲线,该曲线完整可视化了优化问题中回报与约束成本之间的权衡关系。这些曲线显示了 $\lambda$ 的高度敏感性,并进一步证实了缺乏选择最优值 $\lambda^*$ 的通用直觉。我们的研究结果还表明,由于学习轨迹的巨大差异,自动乘子更新能够恢复甚至有时超过在 $\lambda^*$ 处找到的最优性能。此外,我们发现自动乘子更新在训练期间表现出振荡行为,这可以通过PID控制的更新来缓解。然而,该方法需要仔细调参才能在跨任务中实现持续更好的性能。这凸显了对安全强化学习中拉格朗日方法进行稳定性进一步研究的必要性。用于复现我们结果的代码可在 https://github.com/lindsayspoor/Lagrangian_SafeRL 找到。