Reinforcement learning (RL) algorithms have shown impressive success in exploring high-dimensional environments to learn complex, long-horizon tasks, but can often exhibit unsafe behaviors and require extensive environment interaction when exploration is unconstrained. A promising strategy for safe learning in dynamically uncertain environments is requiring that the agent can robustly return to states where task success (and therefore safety) can be guaranteed. While this approach has been successful in low-dimensions, enforcing this constraint in environments with high-dimensional state spaces, such as images, is challenging. We present Latent Space Safe Sets (LS3), which extends this strategy to iterative, long-horizon tasks with image observations by using suboptimal demonstrations and a learned dynamics model to restrict exploration to the neighborhood of a learned Safe Set where task completion is likely. We evaluate LS3 on 4 domains, including a challenging sequential pushing task in simulation and a physical cable routing task. We find that LS3 can use prior task successes to restrict exploration and learn more efficiently than prior algorithms while satisfying constraints. See https://tinyurl.com/latent-ss for code and supplementary material.
翻译:强化学习(RL)算法在探索高维环境以学习复杂、长方位任务方面表现出令人印象深刻的成功,但在勘探不受限制的情况下,往往会表现出不安全的行为,需要广泛的环境互动。在动态不确定的环境中,安全学习的有希望的战略要求代理商能够有力地返回任务成功(因此也是安全)可以保证的国家。虽然这种方法在低维层面取得了成功,但在图像等高维状态空间环境中实施这种制约是具有挑战性的。我们介绍了“冷冻空间安全套件”(LS3),它将这一战略扩大到具有图像观测的迭代性、长方位任务,使用亚优性演示和学习的动态模型将探索限制在可能完成任务的安全套区附近。我们评估了4个领域的LS3,包括模拟中具有挑战性的连续推进任务和有形电缆定线任务。我们发现,LS3可以在满足各种限制的同时,利用先前的任务成功限制来限制探索和学习比先前的算法效率更高。见 https://tinyur.com/latent-s) 用于代码和补充材料。