安全电网管理政策优化 (Action Set Based Policy Optimization for Safe Power Grid Management)

Maintaining the stability of the modern power grid is becoming increasingly difficult due to fluctuating power consumption, unstable power supply coming from renewable energies, and unpredictable accidents such as man-made and natural disasters. As the operation on the power grid must consider its impact on future stability, reinforcement learning (RL) has been employed to provide sequential decision-making in power grid management. However, existing methods have not considered the environmental constraints. As a result, the learned policy has risk of selecting actions that violate the constraints in emergencies, which will escalate the issue of overloaded power lines and lead to large-scale blackouts. In this work, we propose a novel method for this problem, which builds on top of the search-based planning algorithm. At the planning stage, the search space is limited to the action set produced by the policy. The selected action strictly follows the constraints by testing its outcome with the simulation function provided by the system. At the learning stage, to address the problem that gradients cannot be propagated to the policy, we introduce Evolutionary Strategies (ES) with black-box policy optimization to improve the policy directly, maximizing the returns of the long run. In NeurIPS 2020 Learning to Run Power Network (L2RPN) competition, our solution safely managed the power grid and ranked first in both tracks.

翻译：由于电力消耗波动、可再生能源产生的电力供应不稳定以及人为和自然灾害等无法预测的事故,维持现代电网稳定正变得越来越困难,因为电网的运作必须考虑其对未来稳定的影响,因此,电网的运行必须考虑其对未来稳定的影响,因此,使用强化学习(RL)来提供电网管理的顺序决策;然而,现有方法没有考虑环境制约因素;因此,所学的政策有选择违反紧急情况限制的行动的风险,这将加剧过量电线问题,导致大规模停电。在这项工作中,我们提出了解决这一问题的新方法,该方法以基于搜索的规划算法为顶端。在规划阶段,搜索空间仅限于政策制定的行动。选定的行动严格遵循了限制,通过系统提供的模拟功能测试其结果。在学习阶段,为了解决梯度无法向政策传播的问题,我们引入了采用黑箱政策优化的进化战略,直接改进政策,最大限度地实现长期回报。在NurIPS 2020年的电网级中,我们的第一个电站级都管理着“运行”的电网。