Offline reinforcement learning (RL) tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment. Despite the potential to surpass the behavioral policies, RL-based methods are generally impractical due to the training instability and bootstrapping the extrapolation errors, which always require careful hyperparameter tuning via online evaluation. In contrast, offline imitation learning (IL) has no such issues since it learns the policy directly without estimating the value function by bootstrapping. However, IL is usually limited in the capability of the behavioral policy and tends to learn a mediocre behavior from the dataset collected by the mixture of policies. In this paper, we aim to take advantage of IL but mitigate such a drawback. Observing that behavior cloning is able to imitate neighboring policies with less data, we propose \textit{Curriculum Offline Imitation Learning (COIL)}, which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return, and improves the current policy along curriculum stages. On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
翻译:离线强化学习( RL) 任务要求代理人从预先收集的数据集中学习,而不再与环境发生进一步互动。 尽管有可能超越行为政策, 以RL为基础的方法一般不切实际, 因为培训不稳定, 外推错误总是需要通过在线评估进行仔细超光度调整。 相反, 离线模仿学习( IL) 没有这样的问题, 因为它直接学习政策而不估算靴子的值函数 。 但是, IL 通常在行为政策的能力上受到限制, 并且往往从政策混合收集的数据集中学习中学习平庸行为 。 在本文中, 我们的目标是利用 IL 来利用这个工具, 来利用这个工具来学习适应性邻里政策, 并且从当前课程的各个阶段中学习平庸行为 。 在持续控制基准上, 我们将 COL 与 RCIL 相比, 仅仅以 RCIL 和 RBL 的混合行为 来学习 RCL 。