In model-based reinforcement learning for safety-critical control systems, it is important to formally certify system properties (e.g., safety, stability) under the learned controller. However, as existing methods typically apply formal verification \emph{after} the controller has been learned, it is sometimes difficult to obtain any certificate, even after many iterations between learning and verification. To address this challenge, we propose a framework that jointly conducts reinforcement learning and formal verification by formulating and solving a novel bilevel optimization problem, which is differentiable by the gradients from the value function and certificates. Experiments on a variety of examples demonstrate the significant advantages of our framework over the model-based stochastic value gradient (SVG) method and the model-free proximal policy optimization (PPO) method in finding feasible controllers with barrier functions and Lyapunov functions that ensure system safety and stability.
翻译:在以模型为基础的安全关键控制系统强化学习中,必须在所学控制器下正式认证系统属性(例如安全、稳定),但是,由于现有方法通常采用正式核查(emph{fter}),即使学习与核查之间多次迭代,控制器也已经学到了经验,有时很难获得任何证书。为了应对这一挑战,我们提议了一个框架,通过制定和解决一个新的双级优化问题,共同开展强化学习和正式核查,这个问题因数值函数和证书的梯度不同而不同。 各种实例的实验表明,我们的框架比基于模型的随机值梯度(SVG)方法和无模型的准政策优化(PPO)方法在寻找具有屏障功能的可行的控制器和确保系统安全和稳定的Lyapunov功能方面有很大的优势。