Safety is essential for reinforcement learning (RL) applied in the real world. Adding chance constraints (or probabilistic constraints) is a suitable way to enhance RL safety under uncertainty. Existing chance-constrained RL methods like the penalty methods and the Lagrangian methods either exhibit periodic oscillations or learn an over-conservative or unsafe policy. In this paper, we address these shortcomings by proposing a separated proportional-integral Lagrangian (SPIL) algorithm. We first review the constrained policy optimization process from a feedback control perspective, which regards the penalty weight as the control input and the safe probability as the control output. Based on this, the penalty method is formulated as a proportional controller, and the Lagrangian method is formulated as an integral controller. We then unify them and present a proportional-integral Lagrangian method to get both their merits, with an integral separation technique to limit the integral value in a reasonable range. To accelerate training, the gradient of safe probability is computed in a model-based manner. We demonstrate our method can reduce the oscillations and conservatism of RL policy in a car-following simulation. To prove its practicality, we also apply our method to a real-world mobile robot navigation task, where our robot successfully avoids a moving obstacle with highly uncertain or even aggressive behaviors.
翻译:安全是真实世界应用强化学习( RL) 的关键。 增加机会限制( 或概率限制) 是在不确定情况下增强RL安全的合适方法。 现有的受机会限制的RL方法,如罚款方法和拉格兰吉亚方法,要么显示周期性振动,要么学习过度保守或不安全的政策。 在本文中, 我们通过提出一个分离的均衡整体拉格朗吉(SPIL)算法来克服这些缺陷。 我们首先从反馈控制角度来审查受限制的政策优化程序, 将惩罚权重视为控制投入, 安全概率视为控制输出。 基于这一点, 惩罚法被拟订成一个比例控制器, 而拉格兰吉亚方法则被拟订成一个整体控制器。 我们随后统一这些方法, 并展示一个比例性与均衡的拉格朗吉亚方法, 以获得两者的优点, 并采用一个整体的分离技术来限制整体范围内的整体价值。 为了加速培训, 安全概率的梯度以基于模型为基础的方式计算。 我们演示我们的方法可以降低控制力的比重度和安全概率作为控制输出输出。 基于这个方法, 我们演示了它的实际控制方法可以减少振动性, 并用一个真正的机器人模拟, 。