In safety-critical applications, autonomous agents may need to learn in an environment where mistakes can be very costly. In such settings, the agent needs to behave safely not only after but also while learning. To achieve this, existing safe reinforcement learning methods make an agent rely on priors that let it avoid dangerous situations during exploration with high probability, but both the probabilistic guarantees and the smoothness assumptions inherent in the priors are not viable in many scenarios of interest such as autonomous driving. This paper presents an alternative approach inspired by human teaching, where an agent learns under the supervision of an automatic instructor that saves the agent from violating constraints during learning. In this model, we introduce the monitor that neither needs to know how to do well at the task the agent is learning nor needs to know how the environment works. Instead, it has a library of reset controllers that it activates when the agent starts behaving dangerously, preventing it from doing damage. Crucially, the choices of which reset controller to apply in which situation affect the speed of agent learning. Based on observing agents' progress, the teacher itself learns a policy for choosing the reset controllers, a curriculum, to optimize the agent's final policy reward. Our experiments use this framework in two environments to induce curricula for safe and efficient learning.
翻译:在安全关键应用中,自主代理人可能需要在错误可能非常昂贵的环境中学习。在这样的环境下,代理人不仅需要在学习之后而且需要在学习过程中安全地行事。为了实现这一点,现有的安全强化学习方法使代理人依赖在探索期间避免危险情况的前科,这种前科非常有可能避免危险情况,但前科所固有的概率保障和顺畅假设在许多令人感兴趣的情景中(例如自主驾驶)是行不通的。本文件介绍了一种由人类教学启发的替代方法,在这种环境中,代理人在自动教员的监督下学习,使代理人避免在学习期间违反限制。在这个模型中,我们引入一种监测器,既不需要知道如何在代理人正在学习的任务中做好,也不需要知道环境如何运作。相反,它有一个重新设置控制器的图书馆,在代理人开始危险地工作时,它就启动,防止它造成破坏。非常关键的是,在这种情况下,重新设置控制器的选择会影响代理人学习的速度。根据观察的进展,教师本身学习一项选择重置控制器的政策,在我们的两种有效的实验中,学习我们的安全的实验环境。