During training, reinforcement learning systems interact with the world without considering the safety of their actions. When deployed into the real world, such systems can be dangerous and cause harm to their surroundings. Often, dangerous situations can be mitigated by defining a set of rules that the system should not violate under any conditions. For example, in robot navigation, one safety rule would be to avoid colliding with surrounding objects and people. In this work, we define safety rules in terms of the relationships between the agent and objects and use them to prevent reinforcement learning systems from performing potentially harmful actions. We propose a new safe epsilon-greedy algorithm that uses safety rules to override agents' actions if they are considered to be unsafe. In our experiments, we show that a safe epsilon-greedy policy significantly increases the safety of the agent during training, improves the learning efficiency resulting in much faster convergence, and achieves better performance than the base model.
翻译:在培训期间,强化学习系统与世界互动,而不考虑其行动的安全性。当这些系统被部署到现实世界时,它们可能是危险的,会对其周围环境造成伤害。通常,危险情况可以通过确定一套系统在任何条件下都不应违反的规则来缓解。例如,在机器人导航中,一项安全规则是避免与周围物体和人员发生碰撞。在这项工作中,我们从代理人和物体之间的关系的角度来界定安全规则,并利用这些规则防止强化学习系统进行可能有害的行动。我们建议一种新的安全百合算法,在认为不安全时使用安全规则来取代代理人的行动。我们在实验中显示,一项安全的易合金政策将大大提高代理人在培训期间的安全性,提高学习效率,从而更快地实现趋同,并取得比基本模式更好的业绩。