The recent offline reinforcement learning (RL) studies have achieved much progress to make RL usable in real-world systems by learning policies from pre-collected datasets without environment interaction. Unfortunately, existing offline RL methods still face many practical challenges in real-world system control tasks, such as computational restriction during agent training and the requirement of extra control flexibility. The model-based planning framework provides an attractive alternative. However, most model-based planning algorithms are not designed for offline settings. Simply combining the ingredients of offline RL with existing methods either provides over-restrictive planning or leads to inferior performance. We propose a new light-weighted model-based offline planning framework, namely MOPP, which tackles the dilemma between the restrictions of offline learning and high-performance planning. MOPP encourages more aggressive trajectory rollout guided by the behavior policy learned from data, and prunes out problematic trajectories to avoid potential out-of-distribution samples. Experimental results show that MOPP provides competitive performance compared with existing model-based offline planning and RL approaches.
翻译:最近的离线强化学习(RL)研究取得了很大进展,通过在不进行环境互动的情况下从预先收集的数据集中学习政策,使RL在现实世界系统中发挥作用。不幸的是,现有的离线RL方法在现实世界系统控制任务中仍然面临许多实际挑战,如代理培训中的计算限制和额外控制灵活性的要求。基于模型的规划框架提供了一个有吸引力的替代办法。然而,大多数基于模型的规划算法并不是针对离线环境设计的。只是将离线RL的成分与现有方法结合起来,要么提供过度限制规划,要么导致低效性能。我们提出了一个新的基于轻量的基于模型的离线规划框架,即MOPP,它解决了离线学习限制和高性能规划之间的两难点。MOP鼓励在从数据中汲取的行为政策指导下更积极的展开轨道,并排除有问题的轨迹,以避免潜在的分配外采样。实验结果表明,MOP提供了与现有基于模型的离线规划和RL方法相比具有竞争力的业绩。