Safety is a critical component of autonomous systems and remains a challenge for learning-based policies to be utilized in the real world. In particular, policies learned using reinforcement learning often fail to generalize to novel environments due to unsafe behavior. In this paper, we propose Sim-to-Lab-to-Real to bridge the reality gap with a probabilistically guaranteed safety-aware policy distribution. To improve safety, we apply a dual policy setup where a performance policy is trained using the cumulative task reward and a backup (safety) policy is trained by solving the Safety Bellman Equation based on Hamilton-Jacobi (HJ) reachability analysis. In Sim-to-Lab transfer, we apply a supervisory control scheme to shield unsafe actions during exploration; in Lab-to-Real transfer, we leverage the Probably Approximately Correct (PAC)-Bayes framework to provide lower bounds on the expected performance and safety of policies in unseen environments. Additionally, inheriting from the HJ reachability analysis, the bound accounts for the expectation over the worst-case safety in each environment. We empirically study the proposed framework for ego-vision navigation in two types of indoor environments with varying degrees of photorealism. We also demonstrate strong generalization performance through hardware experiments in real indoor spaces with a quadrupedal robot. See https://sites.google.com/princeton.edu/sim-to-lab-to-real for supplementary material.
翻译:----
安全性是自主系统的重要组成部分,对于基于学习的策略在现实世界中的运用来说,仍然是一个挑战。特别是,使用强化学习学习的策略由于不安全的行为,往往无法推广到新环境。在本文中,我们提出了Sim-to-Lab-to-Real方法,以具有概率保证的安全感知策略分布来弥合现实差距。为了提高安全性,我们应用了双策略设置,其中性能策略是使用任务累积奖励训练的,备用(安全)策略是通过解决基于哈密尔顿-杰克比(HJ)可达性分析的安全Belman方程来训练的。在Sim-to-Lab传输中,我们应用监督控制方案来抵挡探索期间的不安全行为;在Lab-to-Real传输中,我们利用可能大致正确(PAC-Bayes)框架来提供未见环境中策略的预期性能和安全性的下限。此外,从HJ可达性分析继承,边界考虑了每个环境中最坏情况下的安全期望。我们通过在具有不同逼真度的两种类型室内环境中进行自我视觉导航的实证研究来研究所提出的框架。我们还通过在四足机器人实际室内空间进行硬件实验,展示了强的推广性能。有关补充资料,请参见https://sites.google.com/princeton.edu/sim-to-lab-to-real。