Safe exploration is critical for using reinforcement learning (RL) in risk-sensitive environments. Recent work learns risk measures which measure the probability of violating constraints, which can then be used to enable safety. However, learning such risk measures requires significant interaction with the environment, resulting in excessive constraint violations during learning. Furthermore, these measures are not easily transferable to new environments. We cast safe exploration as an offline meta-RL problem, where the objective is to leverage examples of safe and unsafe behavior across a range of environments to quickly adapt learned risk measures to a new environment with previously unseen dynamics. We then propose MEta-learning for Safe Adaptation (MESA), an approach for meta-learning a risk measure for safe RL. Simulation experiments across 5 continuous control domains suggest that MESA can leverage offline data from a range of different environments to reduce constraint violations in unseen environments by up to a factor of 2 while maintaining task performance. See https://tinyurl.com/safe-meta-rl for code and supplementary material.
翻译:安全探索对于在对风险敏感的环境中使用强化学习(RL)至关重要。最近的工作学习了衡量违反限制的可能性的风险措施,这些风险措施可以用来确保安全。然而,学习这些风险措施需要与环境进行大量互动,从而导致在学习期间过度违反限制规定。此外,这些措施不容易转移到新的环境中。我们把安全探索作为一种离线的元RL问题,其目标是利用各种环境中的安全和不安全行为的例子,以便迅速将所学的风险评估措施适用于具有先前看不见动态的新环境。我们然后提出“安全适应MEta学习”(MESA),这是安全适应的元化学习(MESA)的一种方法。在5个连续控制领域进行的模拟实验表明,MESA可以利用一系列不同环境中的离线数据,在保持任务性能的同时,通过高达2倍的系数减少在无形环境中违反限制规定的情况。见https://tinyurl.com/safe-meta-rl,用于代码和补充材料。