We study the problem of learning controllers for discrete-time non-linear stochastic dynamical systems with formal reach-avoid guarantees. This work presents the first method for providing formal reach-avoid guarantees, which combine and generalize stability and safety guarantees, with a tolerable probability threshold $p\in[0,1]$ over the infinite time horizon. Our method leverages advances in machine learning literature and it represents formal certificates as neural networks. In particular, we learn a certificate in the form of a reach-avoid supermartingale (RASM), a novel notion that we introduce in this work. Our RASMs provide reachability and avoidance guarantees by imposing constraints on what can be viewed as a stochastic extension of level sets of Lyapunov functions for deterministic systems. Our approach solves several important problems -- it can be used to learn a control policy from scratch, to verify a reach-avoid specification for a fixed control policy, or to fine-tune a pre-trained policy if it does not satisfy the reach-avoid specification. We validate our approach on $3$ stochastic non-linear reinforcement learning tasks.
翻译:我们研究的是使用离散非线性非线性随机动态系统学习控制器的问题。 这项工作是提供正式无影响保障的第一种方法,它将稳定性和安全保障结合起来,并普遍推广,在无限的时空范围内提供可容忍的概率阈值$p\in[0,1]美元。 我们的方法利用机器学习文献的进步,它代表了神经网络的正式证书。 特别是,我们学习了一种证书,其形式是“达到避免超测”(RASM),这是我们在这项工作中引入的一种新概念。我们的RASMs提供了可达到性和避免的保证,对确定性系统可视为确定性系统Lyapunov功能水平的级级扩展施加了限制。我们的方法解决了几个重要问题 -- 我们的方法可以用来从零开始学习控制政策,核实固定控制政策的达标值,或者在不符合达标值的情况下微调出一项预先培训的政策。 我们用3美元Stochic非线性非线性强化学习任务来验证我们的方法。