State-of-the-art reinforcement learning algorithms mostly rely on being allowed to directly interact with their environment to collect millions of observations. This makes it hard to transfer their success to industrial control problems, where simulations are often very costly or do not exist, and exploring in the real environment can potentially lead to catastrophic events. Recently developed, model-free, offline RL algorithms, can learn from a single dataset (containing limited exploration) by mitigating extrapolation error in value functions. However, the robustness of the training process is still comparatively low, a problem known from methods using value functions. To improve robustness and stability of the learning process, we use dynamics models to assess policy performance instead of value functions, resulting in MOOSE (MOdel-based Offline policy Search with Ensembles), an algorithm which ensures low model bias by keeping the policy within the support of the data. We compare MOOSE with state-of-the-art model-free, offline RL algorithms { BRAC,} BEAR and BCQ on the Industrial Benchmark and MuJoCo continuous control tasks in terms of robust performance, and find that MOOSE outperforms its model-free counterparts in almost all considered cases, often even by far.
翻译:最先进的强化学习算法主要依靠允许其与环境直接互动,以收集数百万的观测结果。这使得很难将其成功转移到工业控制问题,因为模拟往往费用高昂或不存在,而在实际环境中进行探索有可能导致灾难性事件。 最近开发的无模型的离线RL算法可以通过减少价值函数中的外推误,从单一数据集(包含有限的探索)中学习。然而,培训过程的稳健性仍然相对较低,这是使用价值函数的方法所知道的一个问题。为了提高学习过程的稳健性和稳定性,我们使用动态模型来评估政策绩效,而不是评估价值函数,结果产生了MOOSE(以离线为基础的离线政策搜索与Ensembles),这一算法通过在数据支持范围内保持政策,确保了较低的模式偏差。我们将MOSE与最先进的无模型离线的RL算法({BRAC,}BEAR和BQ关于工业基准和Mujoco连续控制任务的动态模型在稳健性业绩方面都得到考虑,并经常发现MOSE在最自由的对应方中发现几乎超越其远的模型。