Safety is a critical component of autonomous systems and remains a challenge for learning-based policies to be utilized in the real world. In particular, policies learned using reinforcement learning often fail to generalize to novel environments due to unsafe behavior. In this paper, we propose Sim-to-Lab-to-Real to bridge the reality gap with a probabilistically guaranteed safety-aware policy distribution. To improve safety, we apply a dual policy setup where a performance policy is trained using the cumulative task reward and a backup (safety) policy is trained by solving the Safety Bellman Equation based on Hamilton-Jacobi (HJ) reachability analysis. In Sim-to-Lab transfer, we apply a supervisory control scheme to shield unsafe actions during exploration; in Lab-to-Real transfer, we leverage the Probably Approximately Correct (PAC)-Bayes framework to provide lower bounds on the expected performance and safety of policies in unseen environments. Additionally, inheriting from the HJ reachability analysis, the bound accounts for the expectation over the worst-case safety in each environment. We empirically study the proposed framework for ego-vision navigation in two types of indoor environments with varying degrees of photorealism. We also demonstrate strong generalization performance through hardware experiments in real indoor spaces with a quadrupedal robot. See https://sites.google.com/princeton.edu/sim-to-lab-to-real for supplementary material.
翻译:安全是自主系统的一个关键组成部分,仍然是在现实世界中应用学习政策的挑战。特别是,通过强化学习学习所学的政策往往由于不安全行为而不能推广到新环境。在本文中,我们提议Sim-to-Lab-Real,以一个可靠保障安全意识的政策分布来弥补现实差距。为了改善安全,我们采用双重政策设置,利用累积任务奖励和备份(安全)政策培训业绩政策,通过汉密尔顿-贾科比(HJ)的可达性分析解决安全贝尔曼(Bellman Equation)问题进行培训。在Sim-Lab转让中,我们应用监督控制计划来保护探索期间的不安全行动;在实验室-Real转让中,我们利用“可能大约正(PAC)-Bayye框架”来弥补现实差距,以更低的限度限制在隐蔽环境中的政策的预期业绩和安全。此外,从HJ的可达标度分析中继承,每个环境中最坏情况安全性的补充性核算。我们实证地研究了在两种室内软体环境中进行自我定位/正视现实航行的强有力框架框架。我们也以不同程度地研究。