It is challenging to use reinforcement learning (RL) in cyber-physical systems due to the lack of safety guarantees during learning. Although there have been various proposals to reduce undesired behaviors during learning, most of these techniques require prior system knowledge, and their applicability is limited. This paper aims to reduce undesired behaviors during learning without requiring any prior system knowledge. We propose dynamic shielding: an extension of a model-based safe RL technique called shielding using automata learning. The dynamic shielding technique constructs an approximate system model in parallel with RL using a variant of the RPNI algorithm and suppresses undesired explorations due to the shield constructed from the learned model. Through this combination, potentially unsafe actions can be foreseen before the agent experiences them. Experiments show that our dynamic shield significantly decreases the number of undesired events during training.
翻译:由于学习期间缺乏安全保障,在网络物理系统中使用强化学习(RL)具有挑战性,因为学习期间缺乏安全保障,因此在网络物理系统中使用强化学习(RL)具有挑战性。虽然在学习期间提出了减少不受欢迎的行为的各种建议,但大多数这些技术都需要事先的系统知识,而且其适用性有限。本文件旨在减少学习期间的不可取行为,而不需要任何先前的系统知识。我们提议动态屏蔽:扩展一种基于模型的安全RL技术,称为使用自动磁数据学习屏蔽。动态屏蔽技术利用RPNI算法的变种,与RL平行地构建了一种近似系统模型,并抑制了由于从所学模型中搭建的屏蔽而导致的不理想探索。通过这种结合,可以在代理体经历之前预见到潜在的不安全行动。实验表明,我们的动态屏蔽大大减少了培训期间不理想事件的数量。