Reinforcement Learning (RL) and continuous nonlinear control have been successfully deployed in multiple domains of complicated sequential decision-making tasks. However, given the exploration nature of the learning process and the presence of model uncertainty, it is challenging to apply them to safety-critical control tasks due to the lack of safety guarantee. On the other hand, while combining control-theoretical approaches with learning algorithms has shown promise in safe RL applications, the sample efficiency of safe data collection process for control is not well addressed. In this paper, we propose a \emph{provably} sample efficient episodic safe learning framework for online control tasks that leverages safe exploration and exploitation in an unknown, nonlinear dynamical system. In particular, the framework 1) extends control barrier functions (CBFs) in a stochastic setting to achieve provable high-probability safety under uncertainty during model learning and 2) integrates an optimism-based exploration strategy to efficiently guide the safe exploration process with learned dynamics for \emph{near optimal} control performance. We provide formal analysis on the episodic regret bound against the optimal controller and probabilistic safety with theoretical guarantees. Simulation results are provided to demonstrate the effectiveness and efficiency of the proposed algorithm.
翻译:然而,鉴于学习过程的探索性质和模型不确定性的存在,将之应用到安全关键控制任务,因为缺乏安全保障。另一方面,在将控制理论方法与学习算法相结合,在安全的RL应用中,控制安全数据收集过程的抽样效率没有很好地解决。在本文件中,我们提议为在线控制任务建立一个高效的随机安全学习框架,在未知的非线性动态系统中利用安全探索和利用。特别是,框架1 (CBFs)扩展控制屏障功能,在随机环境中实现在模型学习期间不确定性下的可确认高概率安全,2 结合基于乐观的探索战略,以有效指导安全勘探过程,学习到的动态来进行\ emph{ 最优化控制}绩效。我们正式分析了针对最佳控制控制与稳定性演算法的后退。我们提供了对最佳控制与稳定性演算法的后退。