In model-based reinforcement learning for safety-critical control systems, it is important to formally certify system properties (e.g., safety, stability) under the learned controller. However, as existing methods typically apply formal verification \emph{after} the controller has been learned, it is sometimes difficult to obtain any certificate, even after many iterations between learning and verification. To address this challenge, we propose a framework that jointly conducts reinforcement learning and formal verification by formulating and solving a novel bilevel optimization problem, which is differentiable by the gradients from the value function and certificates. Experiments on a variety of examples demonstrate the significant advantages of our framework over the model-based stochastic value gradient (SVG) method and the model-free proximal policy optimization (PPO) method in finding feasible controllers with barrier functions and Lyapunov functions that ensure system safety and stability.
翻译:在面向安全关键控制系统的基于模型的强化学习中,重要的是在学习到控制器后正式认证系统属性(如安全性、稳定性)。然而,由于现有的方法通常在学习后应用形式验证,因此即使在学习和验证之间进行了多次迭代,有时也很难获得任何证书。为了解决这个挑战,我们提出了一个框架,通过制定和解决一个新的双层优化问题,共同进行强化学习和形式验证,该问题可以通过价值函数和证书的梯度进行可微分。在各种示例上的实验表明,我们的框架相对于基于模型的随机值梯度 (SVG) 方法和基于模型的无模型邻近策略优化 (PPO) 方法,在使用屏障函数和李雅普诺夫函数确保系统安全和稳定的可行控制器方面具有显着优势。