Conventional reinforcement learning (RL) needs an environment to collect fresh data, which is impractical when online interactions are costly. Offline RL provides an alternative solution by directly learning from the previously collected dataset. However, it will yield unsatisfactory performance if the quality of the offline datasets is poor. In this paper, we consider an offline-to-online setting where the agent is first learned from the offline dataset and then trained online, and propose a framework called Adaptive Policy Learning for effectively taking advantage of offline and online data. Specifically, we explicitly consider the difference between the online and offline data and apply an adaptive update scheme accordingly, that is, a pessimistic update strategy for the offline dataset and an optimistic/greedy update scheme for the online dataset. Such a simple and effective method provides a way to mix the offline and online RL and achieve the best of both worlds. We further provide two detailed algorithms for implementing the framework through embedding value or policy-based RL algorithms into it. Finally, we conduct extensive experiments on popular continuous control tasks, and results show that our algorithm can learn the expert policy with high sample efficiency even when the quality of offline dataset is poor, e.g., random dataset.
翻译:常规强化学习( RL) 需要一种收集新数据的环境, 而当在线互动费用昂贵时,这是不切实际的。 离线RL 通过直接从先前收集的数据集中直接学习, 提供了一个替代解决方案。 但是, 如果离线数据集的质量差, 则会产生不满意的性能。 在本文中, 我们考虑一种离线到在线的设置, 代理首先从离线数据集中学习, 然后在网上培训, 并提出一个称为适应性政策学习的框架, 以便有效地利用离线和在线数据。 具体地说, 我们明确考虑在线数据和离线数据之间的差异, 并相应应用适应性更新计划, 也就是说, 离线数据集的悲观性更新战略, 以及在线数据集的乐观/ 微调更新计划 。 这样简单有效的方法可以混合离线和在线的 RL, 并实现两个世界的最佳目标。 我们还提供了两个详细的算法, 通过将价值或基于政策的 RL 算法嵌入框架的实施框架。 最后, 我们对大众持续控制任务进行了广泛的实验, 并且结果显示我们的算算算算算算算算出, 当高数据时, 数据是随机的, 数据是随机性数据, 。</s>