Safety is essential for reinforcement learning (RL) applied in real-world tasks like autonomous driving. Chance constraints which guarantee the satisfaction of state constraints at a high probability are suitable to represent the requirements in real-world environment with uncertainty. Existing chance constrained RL methods like the penalty method and the Lagrangian method either exhibit periodic oscillations or cannot satisfy the constraints. In this paper, we address these shortcomings by proposing a separated proportional-integral Lagrangian (SPIL) algorithm. Taking a control perspective, we first interpret the penalty method and the Lagrangian method as proportional feedback and integral feedback control, respectively. Then, a proportional-integral Lagrangian method is proposed to steady learning process while improving safety. To prevent integral overshooting and reduce conservatism, we introduce the integral separation technique inspired by PID control. Finally, an analytical gradient of the chance constraint is utilized for model-based policy optimization. The effectiveness of SPIL is demonstrated by a narrow car-following task. Experiments indicate that compared with previous methods, SPIL improves the performance while guaranteeing safety, with a steady learning process.
翻译:安全是强化学习(RL)在诸如自主驾驶等现实世界性任务中应用的安全性(RL)的关键。 保证满足高概率国家制约因素的机会限制是适合反映现实世界环境中不确定的要求的。 现有的受机会限制的RL方法,如惩罚方法和拉格朗加法,要么显示周期性振荡,要么无法满足限制。 在本文件中,我们通过提出一个分离的成比例-整体拉格朗加法(SPIL)算法(SPIL)来解决这些缺陷。 从控制的角度出发,我们首先将罚款方法和拉格朗加法分别解释为比例反馈和整体反馈控制。 然后,建议采用比例-整体拉格朗加法来稳步学习过程,同时改善安全。 为了防止整体过度射击和减少保守主义,我们引入了由PID控制所启发的综合分离技术。 最后,一个机会限制的分析梯度梯度被用于基于模型的政策优化。 PSIL的有效性通过狭隘的汽车跟踪任务得到证明。 实验表明,与以往方法相比,SPIL在稳步学习过程中改进了工作。