Reinforcement learning (RL) has achieved promising results on most robotic control tasks. Safety of learning-based controllers is an essential notion of ensuring the effectiveness of the controllers. Current methods adopt whole consistency constraints during the training, thus resulting in inefficient exploration in the early stage. In this paper, we propose a Constrained Policy Optimization with Extra Safety Budget (ESB-CPO) algorithm to strike a balance between the exploration and the constraints. In the early stage, our method loosens the practical constraints of unsafe transitions (adding extra safety budget) with the aid of a new metric we propose. With the training process, the constraints in our optimization problem become tighter. Meanwhile, theoretical analysis and practical experiments demonstrate that our method gradually meets the cost limit's demand in the final training stage. When evaluated on Safety-Gym and Bullet-Safety-Gym benchmarks, our method has shown its advantages over baseline algorithms in terms of safety and optimality. Remarkably, our method gains remarkable performance improvement under the same cost limit compared with CPO algorithm.
翻译:强化学习(RL)在多数机器人控制任务中取得了可喜的成果。学习控制器的安全是确保控制器有效性的基本概念。目前的方法在培训期间采用整体一致性限制,从而导致早期探索效率低下。在本文件中,我们提议采用限制政策优化与额外安全预算(ESB-CPO)的算法,以在勘探和制约之间取得平衡。在早期,我们的方法通过我们提出的新指标帮助,放松了不安全过渡(增加额外安全预算)的实际限制。随着培训过程,我们优化问题中的制约因素变得更为严格。与此同时,理论分析和实际实验表明,我们的方法在最后培训阶段逐渐满足了成本限制的需求。在对安全-吉姆和子弹-安全-吉姆基准进行评估时,我们的方法在安全和最佳性的基准算法方面表现出了优势。值得注意的是,我们的方法在与CPO算法相同的成本限制下取得了显著的业绩改进。</s>