While safe reinforcement learning (RL) holds great promise for many practical applications like robotics or autonomous cars, current approaches require specifying constraints in mathematical form. Such specifications demand domain expertise, limiting the adoption of safe RL. In this paper, we propose learning to interpret natural language constraints for safe RL. To this end, we first introduce HazardWorld, a new multi-task benchmark that requires an agent to optimize reward while not violating constraints specified in free-form text. We then develop an agent with a modular architecture that can interpret and adhere to such textual constraints while learning new tasks. Our model consists of (1) a constraint interpreter that encodes textual constraints into spatial and temporal representations of forbidden states, and (2) a policy network that uses these representations to produce a policy achieving minimal constraint violations during training. Across different domains in HazardWorld, we show that our method achieves higher rewards (up to11x) and fewer constraint violations (by 1.8x) compared to existing approaches. However, in terms of absolute performance, HazardWorld still poses significant challenges for agents to learn efficiently, motivating the need for future work.
翻译:虽然安全强化学习(RL)对于机器人或自主汽车等许多实际应用有很大的希望,但目前的方法要求以数学形式具体说明限制。这种规格要求域域专长,限制采用安全RL。在本文件中,我们提议学习解释安全RL的自然语言限制。为此目的,我们首先引入一个新的多任务基准,即危险世界,它要求一种代理人在不违反自由形式文本中规定的限制的同时,优化奖励,同时不违反自由形式文本中规定的限制。然后,我们开发一个模块结构的代理人,在学习新任务的同时,可以解释和遵守这种文字限制。我们的模式包括:(1) 将文字限制纳入被禁止国家的空间和时间表达方式的制约解释,以及(2) 利用这些表述形成政策网络,在培训过程中实现最低限度的限制违规现象。在危险世界的不同领域,我们表明我们的方法比现有方法获得更高的奖励(高达11x)和较少的制约(为1.8x)。然而,在绝对业绩方面,危险世界仍然给代理人带来重大挑战,促使他们高效率地学习未来工作的需要。