Exploring in an unknown system can place an agent in dangerous situations, exposing to potentially catastrophic hazards. Many current approaches for tackling safe learning in reinforcement learning (RL) lead to a trade-off between safe exploration and fulfilling the task. Though these methods possibly incur fewer safety violations, they often also lead to reduced task performance. In this paper, we take the first step in introducing a generation of RL solvers that learn to minimise safety violations while maximising the task reward to the extend that can be tolerated by safe policies. Our approach uses a new two-player framework for safe RL called Distributive Exploration Safety Training Algorithm (DESTA). The core of DESTA is a novel game between two RL agents: SAFETY AGENT that is delegated the task of minimising safety violations and TASK AGENT whose goal is to maximise the reward set by the environment task. SAFETY AGENT can selectively take control of the system at any given point to prevent safety violations while TASK AGENT is free to execute its actions at all other states. This framework enables SAFETY AGENT to learn to take actions that minimise future safety violations (during and after training) by performing safe actions at certain states while TASK AGENT performs actions that maximise the task performance everywhere else. We demonstrate DESTA's ability to tackle challenging tasks and compare against state-of-the-art RL methods in Safety Gym Benchmarks which simulate real-world physical systems and OpenAI's Lunar Lander.
翻译:在一个未知的系统中进行探索,可以将一个代理人置于危险情况下,暴露于潜在的灾难性危险之中。许多当前在强化学习中安全学习的方法(RL)导致安全探索和完成任务之间的权衡。虽然这些方法可能带来较少的安全侵犯,但往往也会导致任务绩效下降。在本文件中,我们迈出第一步,引入新一代RL解答器,学会尽量减少违反安全的行为,同时将任务奖励最大化到安全政策所允许的延伸。我们的方法使用一个新的双角色框架,安全RL(分配探索安全培训Agorithm (DESTA) ) 。DESTA的核心是两个RL代理器之间的一场新游戏:SafETY AGENTE,负责尽量减少违反安全的行为,TASK AGENT, 目标是最大限度地增加环境任务所设定的奖励。 SAETY AGENT可以选择在任何特定时刻对系统进行控制,以防止违反安全的行为,TASK AGERTA, 并且以最有挑战性的行动比行动,我们在地面上进行最激烈的行动。