强化学习在线防护网 (Online Shielding for Reinforcement Learning)

Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next $k$ steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game SNAKE. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.

翻译：除了最近关于强化学习的令人印象深刻的成果(RL)之外,安全仍然是RL的主要研究挑战之一。RL的安全仍然是RL的主要研究挑战之一。对于任何行动,屏蔽计算出在执行此动作时不违反下一个步骤中安全规定的最大概率。基于此概率和给定阈值,屏蔽决定是否阻止代理人的行动。现有的离线屏蔽方法对未来很短的一段时间之前的所有状态-动作组合的安全进行了详尽的估算,从而导致巨大的计算时间和大量记忆消耗。在运行期间,屏蔽分析每个可用动作的安全性。对于任何行动,屏蔽计算出在执行此动作时不违反下一个步骤中安全规定的最大概率。基于此概率和一个给定的阈值,屏蔽决定决定是否阻止代理人的行动。现有的离线屏蔽方法对所有状态-动作组合的安全性进行了彻底的计算,导致巨大的计算时间和大量记忆消耗。在线屏蔽的直觉是运行时可以计算出所有在近期达到的状态的一组状态。对于每一个步骤,在运行期间,我们所选择的稳定的轨道上的决定都需要快速进行快速计算。对于每个状态,我们所使用的所有动作的安全性计算,我们所使用的所有动作都能够用来分析。