This study is motivated by the critical challenges in the biopharmaceutical manufacturing, including high complexity, high uncertainty, and very limited process data. Each experiment run is often very expensive. To support the optimal and robust process control, we propose a general green simulation assisted policy gradient (GS-PG) framework for both online and offline learning settings. Basically, to address the key limitations of state-of-art reinforcement learning (RL), such as sample inefficiency and low reliability, we create a mixture likelihood ratio based policy gradient estimation that can leverage on the information from historical experiments conducted under different inputs, including process model coefficients and decision policy parameters. Then, to accelerate the learning of optimal and robust policy, we further propose a variance reduction based sample selection method that allows GS-PG to intelligently select and reuse most relevant historical trajectories. The selection rule automatically updates the samples to be reused during the learning of process mechanisms and the search for optimal policy. Our theoretical and empirical studies demonstrate that the proposed framework can perform better than the state-of-art policy gradient approach and accelerate the optimal robust process control for complex stochastic systems under high uncertainty.
翻译:这项研究的动机是生物制药制造中的关键挑战,包括高度复杂、高度不确定性和极为有限的过程数据。每次试验的运行往往非常昂贵。为了支持最佳和稳健的流程控制,我们提议为在线和离线学习环境提供一个通用的绿色模拟辅助政策梯度框架(GS-PG),基本上是为了解决先进强化学习(RL)的关键局限性,例如抽样效率低下和可靠性低,我们创造了一种基于混合概率的政策梯度估计,可以利用在不同投入下进行的历史实验所提供的信息,包括流程模型系数和决定政策参数。然后,为了加速学习最佳和稳健的政策,我们进一步提议一种基于差异的减少样本选择方法,使GS-PG能够明智地选择和再利用最相关的历史轨迹。选择规则自动更新了在学习流程机制和寻找最佳政策期间再利用的样本。我们的理论和实证研究表明,拟议的框架可以比州级政策梯度方法更好地运行,并加速对高度不确定性的复杂查查系统进行最佳稳健的流程控制。