利用数据驱动预测控制进行安全强化学习 (Safe Reinforcement Learning using Data-Driven Predictive Control)

Reinforcement learning (RL) algorithms can achieve state-of-the-art performance in decision-making and continuous control tasks. However, applying RL algorithms on safety-critical systems still needs to be well justified due to the exploration nature of many RL algorithms, especially when the model of the robot and the environment are unknown. To address this challenge, we propose a data-driven safety layer that acts as a filter for unsafe actions. The safety layer uses a data-driven predictive controller to enforce safety guarantees for RL policies during training and after deployment. The RL agent proposes an action that is verified by computing the data-driven reachability analysis. If there is an intersection between the reachable set of the robot using the proposed action, we call the data-driven predictive controller to find the closest safe action to the proposed unsafe action. The safety layer penalizes the RL agent if the proposed action is unsafe and replaces it with the closest safe one. In the simulation, we show that our method outperforms state-of-the-art safe RL methods on the robotics navigation problem for a Turtlebot 3 in Gazebo and a quadrotor in Unreal Engine 4 (UE4).

翻译：强化学习(RL)算法可以在决策和连续控制任务中实现最先进的决策性能和连续控制性任务。但是,由于许多RL算法的探索性质,特别是在机器人模型和环境未知的情况下,在安全关键系统中应用RL算法仍然需要充分的理由。为了应对这一挑战,我们提议了一个数据驱动安全层,作为不安全行动的过滤器。安全层使用数据驱动的预测控制器,在培训和部署期间和部署后对RL政策实施安全保障。RL代理器提议了一个通过计算数据驱动的可达性分析加以核实的行动。如果使用拟议行动在可达的机器人组之间有一个交叉点,我们叫数据驱动的预测控制器找到与拟议不安全行动最接近的安全行动。如果拟议行动不安全,则安全层会惩罚RL代理器,并用最接近的安全层取而代之。在模拟中,我们显示我们的方法比Gazebot 3 和 Streal 4 Streal的机器人导航问题的最新安全RL方法要快。