Safety is a crucial necessity in many applications of reinforcement learning (RL), whether robotic, automotive, or medical. Many existing approaches to safe RL rely on receiving numeric safety feedback, but in many cases this feedback can only take binary values; that is, whether an action in a given state is safe or unsafe. This is particularly true when feedback comes from human experts. We therefore consider the problem of provable safe RL when given access to an offline oracle providing binary feedback on the safety of state, action pairs. We provide a novel meta algorithm, SABRE, which can be applied to any MDP setting given access to a blackbox PAC RL algorithm for that setting. SABRE applies concepts from active learning to reinforcement learning to provably control the number of queries to the safety oracle. SABRE works by iteratively exploring the state space to find regions where the agent is currently uncertain about safety. Our main theoretical results shows that, under appropriate technical assumptions, SABRE never takes unsafe actions during training, and is guaranteed to return a near-optimal safe policy with high probability. We provide a discussion of how our meta-algorithm may be applied to various settings studied in both theoretical and empirical frameworks.
翻译:在许多强化学习(RL)应用中,安全是安全的关键,无论是机器人、汽车还是医疗。许多现有的安全学习(RL)方法都依赖于接受数字安全反馈,但在许多情况下,这种反馈只能包含二进制值;也就是说,在特定状态的行动是否安全或不安全。当来自人类专家的反馈时,情况尤其如此。因此,我们考虑当获得离线或触角时,安全学习(RL)的问题,提供国家安全、行动配对的二进制反馈。我们提供了一种新的元算法(SABRE),它可以适用于任何MDP设置,因为该设置获得黑盒 PAC RL 算法的机会。SABRE从积极学习到强化学习,以可辨别地控制对安全标志的查询次数。SABRE通过迭接地探索国家空间,寻找目前代理在安全方面的不确定的区域,从而发现安全。我们的主要理论结果表明,根据适当的技术假设,SABRE在培训期间从未采取不安全的行动,并且保证返回近最佳的安全政策,而且极有可能。我们从各种理论框架中进行理论研究。