Reinforcement learning (RL) is a promising optimal control technique for multi-energy management systems. It does not require a model a priori - reducing the upfront and ongoing project-specific engineering effort and is capable of learning better representations of the underlying system dynamics. However, vanilla RL does not provide constraint satisfaction guarantees - resulting in various potentially unsafe interactions within its safety-critical environment. In this paper, we present two novel safe RL methods, namely SafeFallback and GiveSafe, where the safety constraint formulation is decoupled from the RL formulation and which provides hard-constraint satisfaction guarantees both during training a (near) optimal policy (which involves exploratory and exploitative, i.e. greedy, steps) as well as during deployment of any policy (e.g. random agents or offline trained RL agents). In a simulated multi-energy systems case study we have shown that both methods start with a significantly higher utility (i.e. useful policy) compared to a vanilla RL benchmark (94,6% and 82,8% compared to 35,5%) and that the proposed SafeFallback method even can outperform the vanilla RL benchmark (102,9% to 100%). We conclude that both methods are viably safety constraint handling techniques applicable beyond RL, as demonstrated with random policies while still providing hard-constraint guarantees. Finally, we propose directions for future work to i.a. improve the constraint functions itself as more data becomes available.
翻译:强化强化学习(RL)是多能源管理系统中最有希望的最佳控制技术(RL) 。 它不需要先验性模型 — — 减少前期和正在进行的具体项目工程工作,能够学习更好的系统动态表征。 但是,香草RL没有提供约束性满意度保证 — — 导致在安全关键环境中出现各种潜在的不安全互动。 在本文中,我们介绍了两种新型的安全安全RL方法,即安全Fallback和GeletSafe, 安全限制配方与RL的配方脱钩,在培训一个(近于)最佳政策(涉及探索性和剥削性,即贪婪,步骤)以及部署任何政策(如随机剂或经过离线培训的RL代理)期间,都提供更好的满意度保证。 在模拟的多能源系统案例研究中,我们发现这两种方法的效用(即有用政策)比香草RL的基准(94.6%和82.8%,比35.5%)提供硬性满意度的保证,而拟议的SafallFARin 1025 最佳政策(我们最终可以提出采用100%的保证方法,我们最后提出安全性要求。