Deploying Reinforcement Learning (RL) agents in the real-world require that the agents satisfy safety constraints. Current RL agents explore the environment without considering these constraints, which can lead to damage to the hardware or even other agents in the environment. We propose a new method, LBPO, that uses a Lyapunov-based barrier function to restrict the policy update to a safe set for each training iteration. Our method also allows the user to control the conservativeness of the agent with respect to the constraints in the environment. LBPO significantly outperforms state-of-the-art baselines in terms of the number of constraint violations during training while being competitive in terms of performance. Further, our analysis reveals that baselines like CPO and SDDPG rely mostly on backtracking to ensure safety rather than safe projection, which provides insight into why previous methods might not have effectively limit the number of constraint violations.
翻译:在现实世界中部署强化学习剂(RL)要求代理商满足安全限制。目前的RL代理商在不考虑这些限制的情况下对环境进行勘探,这些限制可能导致硬件乃至环境中其他代理商的损害。我们提出了一种新的方法,即LBPO, 使用基于Lyapunov的屏障功能将政策更新限制在每套培训迭代的安全套件上。我们的方法还使用户能够控制代理商在环境限制方面的保守性。LBPO在培训期间违反限制规定的次数方面明显超过最新水平的基线,同时在业绩方面具有竞争力。此外,我们的分析表明,像CPO和SPDPG这样的基线主要依靠反向轨道,以确保安全,而不是安全预测。这使人们了解了为什么以前的方法可能没有有效地限制违反限制次数。