Safety is essential for reinforcement learning (RL) applied in real-world situations. Chance constraints are suitable to represent the safety requirements in stochastic systems. Previous chance-constrained RL methods usually have a low convergence rate, or only learn a conservative policy. In this paper, we propose a model-based chance constrained actor-critic (CCAC) algorithm which can efficiently learn a safe and non-conservative policy. Different from existing methods that optimize a conservative lower bound, CCAC directly solves the original chance constrained problems, where the objective function and safe probability is simultaneously optimized with adaptive weights. In order to improve the convergence rate, CCAC utilizes the gradient of dynamic model to accelerate policy optimization. The effectiveness of CCAC is demonstrated by a stochastic car-following task. Experiments indicate that compared with previous RL methods, CCAC improves the performance while guaranteeing safety, with a five times faster convergence rate. It also has 100 times higher online computation efficiency than traditional safety techniques such as stochastic model predictive control.
翻译:在现实世界中,安全对于强化学习(RL)至关重要。机会限制适合于代表随机系统的安全要求。以前受机会限制的RL方法通常会比较低的趋同率,或者只是学习保守政策。在本文中,我们建议一种基于模式的、机会限制的行为者-批评算法,可以有效地学习安全和非保守的政策。不同于现有最优化保守的较低约束方法,CCC直接解决最初的机会限制问题,即目标功能和安全概率与适应加权同时得到优化。为了提高趋同率,CCC利用动态模型的梯度来加速政策优化。CCC的效能表现是执行汽车跟踪任务。实验表明,与以前RL方法相比,CCC在保证安全的同时提高性能,其速度比Stochic模型预测控制等传统安全技术高出100倍。