Reinforcement learning (RL) is a promising method to solve control problems. However, model-free RL algorithms are sample inefficient and require thousands if not millions of samples to learn optimal control policies. A major source of computational cost in RL corresponds to the transition function, which is dictated by the model dynamics. This is especially problematic when model dynamics is represented with coupled PDEs. In such cases, the transition function often involves solving a large-scale discretization of the said PDEs. We propose a multilevel RL framework in order to ease this cost by exploiting sublevel models that correspond to coarser scale discretization (i.e. multilevel models). This is done by formulating an approximate multilevel Monte Carlo estimate of the objective function of the policy and / or value network instead of Monte Carlo estimates, as done in the classical framework. As a demonstration of this framework, we present a multilevel version of the proximal policy optimization (PPO) algorithm. Here, the level refers to the grid fidelity of the chosen simulation-based environment. We provide two examples of simulation-based environments that employ stochastic PDEs that are solved using finite-volume discretization. For the case studies presented, we observed substantial computational savings using multilevel PPO compared to its classical counterpart.
翻译:强化学习(RL)是解决控制问题的一个很有希望的方法。然而,无模型的RL算法是抽样低效的,需要数千甚至数百万个样本才能学习最佳控制政策。在RL中,一个主要的计算成本来源与由模型动态决定的过渡功能相对应。当模型动态与相配的 PDE 代表时,这尤其成问题。在这种情况下,过渡功能往往涉及解决上述PDE 的大规模离散化。我们提议了一个多层次的RL框架,以便通过利用与粗缩规模离散(即多级模型)相对应的子级模型来降低这一成本。这是通过对政策和/或价值网络的目标功能进行大约多层次的蒙特卡洛估计,而不是像在经典框架中所做的那样对蒙特卡洛估计。作为这一框架的示范,我们提出了一种多层次的模型政策优化(PPPPO)算法。在这里,水平是指所选择的模拟环境的电网格的准确性。我们提供了两个基于模拟环境的模拟环境的范例,即采用Stochacist PDE-decal Excial Exalalalalalal adal adal exalalal exalal exal exalal exal exisal exisal exisal exisal exisal exisalment 。我们使用的模拟的模型化模型化环境,我们所观测到的模型化的模型化的模型化的模型的模型,我们所观测到的模型级的模型化的模型级的模型级的模型是使用了对等的模型化的模型化的软化的模型化的模型化的模型级的模型级的模型级的模型级的模型级的模型级的模型的模型,我们所观察到的软化, 。