Offline reinforcement learning (RL) enables the agent to effectively learn from logged data, which significantly extends the applicability of RL algorithms in real-world scenarios where exploration can be expensive or unsafe. Previous works have shown that extracting primitive skills from the recurring and temporally extended structures in the logged data yields better learning. However, these methods suffer greatly when the primitives have limited representation ability to recover the original policy space, especially in offline settings. In this paper, we give a quantitative characterization of the performance of offline hierarchical learning and highlight the importance of learning lossless primitives. To this end, we propose to use a \emph{flow}-based structure as the representation for low-level policies. This allows us to represent the behaviors in the dataset faithfully while keeping the expression ability to recover the whole policy space. We show that such lossless primitives can drastically improve the performance of hierarchical policies. The experimental results and extensive ablation studies on the standard D4RL benchmark show that our method has a good representation ability for policies and achieves superior performance in most tasks.
翻译:离线强化学习(RL)使代理商能够有效地从登录数据中学习,这些数据大大扩展了RL算法在现实世界中的应用性,在现实世界中,勘探可能费用昂贵或不安全。以前的工作表明,从记录数据中反复出现的和时间上延长的结构中提取原始技能可以更好地学习。然而,当原始人恢复原始政策空间的代表性能力有限时,特别是离线环境中的原始政策空间时,这些方法将受到极大影响。在本文件中,我们对离线等级学习的绩效进行定量定性,并强调学习无损原始技术的重要性。为此,我们提议使用基于\emph{流程}的结构作为低层次政策的代表。这使我们能够忠实地代表数据集中的行为,同时保持恢复整个政策空间的表达能力。我们证明,这种无损原始人可以大大改善等级政策的业绩。标准的D4RL基准的实验结果和广泛的反动研究表明,我们的方法在政策上具有良好的代表能力,在多数任务中取得了优异的业绩。