Safe reinforcement learning (RL) with hard constraint guarantees is a promising optimal control direction for multi-energy management systems. It only requires the environment-specific constraint functions itself a prior and not a complete model (i.e. plant, disturbance and noise models, and prediction models for states not included in the plant model - e.g. demand, weather, and price forecasts). The project-specific upfront and ongoing engineering efforts are therefore still reduced, better representations of the underlying system dynamics can still be learned and modeling bias is kept to a minimum (no model-based objective function). However, even the constraint functions alone are not always trivial to accurately provide in advance (e.g. an energy balance constraint requires the detailed determination of all energy inputs and outputs), leading to potentially unsafe behavior. In this paper, we present two novel advancements: (I) combining the Optlayer and SafeFallback method, named OptLayerPolicy, to increase the initial utility while keeping a high sample efficiency. (II) introducing self-improving hard constraints, to increase the accuracy of the constraint functions as more data becomes available so that better policies can be learned. Both advancements keep the constraint formulation decoupled from the RL formulation, so that new (presumably better) RL algorithms can act as drop-in replacements. We have shown that, in a simulated multi-energy system case study, the initial utility is increased to 92.4% (OptLayerPolicy) compared to 86.1% (OptLayer) and that the policy after training is increased to 104.9% (GreyOptLayerPolicy) compared to 103.4% (OptLayer) - all relative to a vanilla RL benchmark. While introducing surrogate functions into the optimization problem requires special attention, we do conclude that the newly presented GreyOptLayerPolicy method is the most advantageous.
翻译:安全强化学习 (RL) 配合硬性约束保证是多能源管理系统优化控制的有希望的方向。它只需要环境特定的约束函数本身以及不完整的模型(即,包括 plant、干扰和噪声模型,以及不包括在 plant 模型中的状态的预测模型 - 例如,需求、天气和价格预测),因此减少了项目特定的前期和持续的工程工作,仍然可以学习到更好的底层系统动态表示,并将建模偏差最小化(没有基于模型的目标函数)。然而,即使仅仅是约束函数本身,也不总是容易提供准确的先验信息(例如,能量平衡约束需要详细确定所有能量输入和输出),从而导致潜在的不安全行为。在本文中,我们提出了两个新的进展:(I)将 Optlayer 和 SafeFallback 方法结合起来,命名为 OptLayerPolicy,以增加初始效用同时保持高样本效率。(II)引入自我完善的硬性约束,以增加可用数据量时约束函数的准确性,从而可以学习到更好的策略。这两个进展使约束公式与强化学习公式分离开来,因此新的(预计更好的)强化学习算法可以作为替代方案。我们在一个模拟的多能源系统案例研究中展示了这些进展,OptLayerPolicy 相对于基准的 vanilla RL 将初始效用提高到 92.4%(OptLayer为86.1%),经过训练后,即在 GreyOptLayerPolicy 下,该策略相对于 OptLayer 提高到 104.9%(OptLayer 为 103.4%)。尽管引入代替函数需要特别注意,但我们得出的结论是,新提出的 GreyOptLayerPolicy 方法是最有利的。