通过高斯进程和控制障碍功能,以时时逻辑进行的安全临界模块式深强化学习 (Safety-Critical Modular Deep Reinforcement Learning with Temporal Logic through Gaussian Processes and Control Barrier Functions)

Reinforcement learning (RL) is a promising approach. However, success is limited towards real-world applications, because ensuring safe exploration and facilitating adequate exploitation is a challenge for controlling robotic systems with unknown models and measurement uncertainties. The learning problem becomes even more difficult for complex tasks over continuous state-space and action-space. In this paper, we propose a learning-based control framework consisting of several aspects: (1) we leverage Linear Temporal Logic (LTL) to express complex tasks over an infinite horizons that are translated to a novel automaton structure; (2) we propose an innovative reward scheme for RL-agents with the formal guarantee that global optimal policies maximize the probability of satisfying the LTL specifications; (3) based on a reward shaping technique, we develop a modular policy-gradient architecture exploiting the benefits of the automaton structure to decompose overall tasks and enhance the performance of learned controllers; (4) by incorporating Gaussian Processes (GPs) to estimate the uncertain dynamic systems, we synthesize a model-based safeguard using Exponential Control Barrier Functions (ECBFs) for systems with high-order relative degrees. In addition, we utilize the properties of LTL automata and ECBFs to develop a guiding process to further improve the efficiency of exploration. Finally, we demonstrate the effectiveness of the framework via several robotic environments. We show an ECBF-based modular deep RL algorithm that achieves near-perfect success rates and safety guarding with high probability confidence during training.

翻译：强化学习(RL)是一个很有希望的方法。但是,成功在现实世界应用方面是有限的,因为确保安全探索和便利充分开发是控制具有未知模型和测量不确定性的机器人系统的挑战,因此,在控制具有未知模型和计量不确定性的机器人系统方面,确保安全探索和便利充分开发是一个挑战。对于持续的国家空间和动作空间的复杂任务,学习问题变得更加困难。在本文件中,我们提议一个基于学习的控制框架,包括几个方面:(1) 我们利用线性时温逻辑(LTL),以表达一个无限的视野的复杂任务,而这种视野将转变为一个新的自动自动图结构;(2) 我们提议一个针对RLA试剂的创新奖励计划,其正式保证全球最佳政策最大限度地提高满足LTL规格的可能性;(3) 基于一种奖励塑造技术,我们开发一个模块式的政策升级型结构,利用自动地图结构结构的效益,拆分解总体任务,提高学习控制者的工作绩效;(4) 通过将GAUS(GP)纳入对不确定的动态系统进行估算,我们综合一种基于模型的保障,利用EBF的深度控制屏障功能,使系统达到近级的相对水平的概率;此外,我们利用EBESAL-RA-RA-R-R-R-D-D-D-D-D-D-D-PL-P-C-C-C-C-C-SAL-S-A-S-S-SL-SL-SL-SL-SL-S-S-S-SL-SL-SL-SL-SL-C-SL-SL-SL-SL-SL-SL-SL-S-S-S-SL-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-