Safety remains a central obstacle preventing widespread use of RL in the real world: learning new tasks in uncertain environments requires extensive exploration, but safety requires limiting exploration. We propose Recovery RL, an algorithm which navigates this tradeoff by (1) leveraging offline data to learn about constraint violating zones before policy learning and (2) separating the goals of improving task performance and constraint satisfaction across two policies: a task policy that only optimizes the task reward and a recovery policy that guides the agent to safety when constraint violation is likely. We evaluate Recovery RL on 6 simulation domains, including two contact-rich manipulation tasks and an image-based navigation task, and an image-based obstacle avoidance task on a physical robot. We compare Recovery RL to 5 prior safe RL methods which jointly optimize for task performance and safety via constrained optimization or reward shaping and find that Recovery RL outperforms the next best prior method across all domains. Results suggest that Recovery RL trades off constraint violations and task successes 2 - 20 times more efficiently in simulation domains and 3 times more efficiently in physical experiments. See https://tinyurl.com/rl-recovery for videos and supplementary material.
翻译:安全仍然是妨碍在现实世界广泛使用保护责任的一个核心障碍:在不确定环境中学习新任务需要广泛探索,但安全则需要限制勘探;我们提议回收RL,这是一种算法,在这种权衡中,在政策学习之前,利用离线数据了解违反禁区的情况;和(2)将改进任务绩效和制约满意度的目标分为两个政策:任务政策只能优化任务奖励和回收政策,在可能出现违反限制的情况下指导代理人实现安全;我们评估6个模拟域的回收RL,包括两个接触丰富的操作任务和基于图像的导航任务,以及物理机器人的图像障碍避免任务。我们比较了回收RL到5个先前的安全RL方法,这些方法通过限制优化或奖励形成,共同优化任务绩效和安全,发现回收RL超越了所有领域之前的下一个最佳方法。结果显示,回收RL交易在模拟领域减少限制和任务成功的次数为2-20倍,在实际实验方面的效率为3倍。见https://tinyurl.com/rl-recrecurr-recurr-review),用于视频和辅助材料。