We present a careful comparison of two model-free control algorithms, Evolution Strategies (ES) and Proximal Policy Optimization (PPO), with receding horizon model predictive control (MPC) for operating simulated, price responsive water heaters. Four MPC variants are considered: a one-shot controller with perfect forecasting yielding optimal control; a limited-horizon controller with perfect forecasting; a mean forecasting-based controller; and a two-stage stochastic programming controller using historical scenarios. In all cases, the MPC model for water temperature and electricity price are exact; only water demand is uncertain. For comparison, both ES and PPO learn neural network-based policies by directly interacting with the simulated environment under the same scenarios used by MPC. All methods are then evaluated on a separate one-week continuation of the demand time series. We demonstrate that optimal control for this problem is challenging, requiring more than 8-hour lookahead for MPC with perfect forecasting to attain the minimum cost. Despite this challenge, both ES and PPO learn good general purpose policies that outperform mean forecast and two-stage stochastic MPC controllers in terms of average cost and are more than two orders of magnitude faster at computing actions. We show that ES in particular can leverage parallelism to learn a policy in under 90 seconds using 1150 CPU cores.
翻译:我们仔细比较了两种无模式的控制算法,即进化战略(ES)和预测政策优化(Proximal Policy Popritical),同时对运行模拟、价格反应快的热水器进行退缩的地平线模型预测控制(MPC),考虑了四种组合式的变体:一个一发控制器,完美预测产生最佳控制;一个有限视距控制器,精确预测;一个平均预测控制器;以及一个使用历史情景的两阶段随机盘查程序控制器。无论如何,水温和电价的MPC模型是准确的;只有水需求是不确定的。相比之下,ESC和PPO都通过在与模拟环境直接互动的情况下学习基于神经网络的政策。然后,所有方法都用单独一周的时间对需求时间序列进行评估,以最佳控制器进行最佳预测;一个平均预测所需的8小时以上的图像控制器,以达到最低的成本。尽管存在这一挑战,但只有水价和电力需求是不确定的。相比之下,ESC和POPO都学习了良好的通用政策政策,这种政策比平均预测为平均预测值,而两阶段的神经网络网络化网络化网络化的网络化政策化政策,在11号中,在平均成本上可以学习比C级快速进行20秒段段段段段段段内,我们学习。