Safety is still one of the major research challenges in reinforcement learning (RL). In this paper, we address the problem of how to avoid safety violations of RL agents during exploration in probabilistic and partially unknown environments. Our approach combines automata learning for Markov Decision Processes (MDPs) and shield synthesis in an iterative approach. Initially, the MDP representing the environment is unknown. The agent starts exploring the environment and collects traces. From the collected traces, we passively learn MDPs that abstractly represent the safety-relevant aspects of the environment. Given a learned MDP and a safety specification, we construct a shield. For each state-action pair within a learned MDP, the shield computes exact probabilities on how likely it is that executing the action results in violating the specification from the current state within the next $k$ steps. After the shield is constructed, the shield is used during runtime and blocks any actions that induce a too large risk from the agent. The shielded agent continues to explore the environment and collects new data on the environment. Iteratively, we use the collected data to learn new MDPs with higher accuracy, resulting in turn in shields able to prevent more safety violations. We implemented our approach and present a detailed case study of a Q-learning agent exploring slippery Gridworlds. In our experiments, we show that as the agent explores more and more of the environment during training, the improved learned models lead to shields that are able to prevent many safety violations.
翻译:安全仍然是强化学习(RL)的主要研究挑战之一。 在本文中,我们探讨了如何避免在概率和部分未知环境中进行勘探时违反RL剂安全的问题。 我们的方法结合了Markov 决策程序(MDPs)的自动学习和以迭代方式进行屏蔽合成。 最初, 代表环境的MDP是未知的。 代理开始探索环境并收集痕迹。 从所收集的痕迹中, 我们被动地学习了抽象地代表环境安全相关方面的MDP。 根据所学的 MDP 和安全规格, 我们建造了一个屏蔽。 对于在所学的 MDP 中的每一对州行动配对来说, 屏蔽都精确地计算出执行行动在下一个 $k$ 步骤中导致违反规格的可能性。 在屏蔽建立后, 屏蔽开始探索环境, 阻挡任何从代理人带来太大风险的行动。 屏蔽剂继续探索环境, 并收集环境方面的新数据。 我们用所收集的数据来防止在所收集的每一个州行动配方之间精确度精确度如何执行行动。 我们用一个更精确的模型来学习新的MDP 。