In the research area of offline model-based optimization, novel and promising methods are frequently developed. However, implementing such methods in real-world industrial systems such as production lines for process control is oftentimes a frustrating process. In this work, we address two important problems to extend the current success of offline model-based optimization to industrial process control problems: 1) how to learn a reliable dynamics model from offline data for industrial processes? 2) how to learn a reliable but not over-conservative control policy from offline data by utilizing existing model-based optimization algorithms? Specifically, we propose a dynamics model based on ensemble of conditional generative adversarial networks to achieve accurate reward calculation in industrial scenarios. Furthermore, we propose an epistemic-uncertainty-penalized reward evaluation function which can effectively avoid giving over-estimated rewards to out-of-distribution inputs during the learning/searching of the optimal control policy. We provide extensive experiments with the proposed method on two representative cases (a discrete control case and a continuous control case), showing that our method compares favorably to several baselines in offline policy learning for industrial process control.
翻译:在离线模型优化研究领域,经常会开发出新的和有希望的方法。然而,在现实世界工业系统中实施这种方法,例如生产流程控制生产线,往往是一个令人沮丧的过程。在这项工作中,我们处理两个重要问题,将离线模型优化目前的成功扩大到工业流程控制问题:(1) 如何从工业流程离线数据中学习可靠的动态模型?(2) 如何利用现有基于模型优化算法,从离线数据中学习可靠但非过度保守的控制政策?具体地说,我们提出了一个动态模型,其基础是有条件的配对对抗网络,以便在工业情景中实现准确的奖赏计算。此外,我们提出一个具有共认别式的、不确定性、附带报酬的奖赏评价功能,这可以有效避免在学习/研究最佳控制政策期间对分配投入给予过高的奖励。我们对两个具有代表性的案例(一个离散的控制案例和一个持续控制案例)建议的方法进行了广泛的试验,表明我们的方法优于工业流程控制离线政策学习的若干基线。