The recent success of supervised learning methods on ever larger offline datasets has spurred interest in the reinforcement learning (RL) field to investigate whether the same paradigms can be translated to RL algorithms. This research area, known as offline RL, has largely focused on offline policy optimization, aiming to find a return-maximizing policy exclusively from offline data. In this paper, we consider a slightly different approach to incorporating offline data into sequential decision-making. We aim to answer the question, what unsupervised objectives applied to offline datasets are able to learn state representations which elevate performance on downstream tasks, whether those downstream tasks be online RL, imitation learning from expert demonstrations, or even offline policy optimization based on the same offline dataset? Through a variety of experiments utilizing standard offline RL datasets, we find that the use of pretraining with unsupervised learning objectives can dramatically improve the performance of policy learning algorithms that otherwise yield mediocre performance on their own. Extensive ablations further provide insights into what components of these unsupervised objectives -- e.g., reward prediction, continuous or discrete representations, pretraining or finetuning -- are most important and in which settings.
翻译:在更大的离线数据集方面,最近监督监督学习方法的成功,激发了人们对强化学习(RL)领域的兴趣,以调查是否可以将同样的范式转化为RL算法。这个研究领域被称为离线RL,主要侧重于离线政策优化,目的是寻找完全从离线数据返回最大化政策。在本文中,我们认为,将离线数据纳入顺序决策的方法略有不同。我们的目标是回答问题,哪些适用于离线数据集的未监督目标能够了解提升下游任务绩效的状态表现,这些下游任务是在线RL,从专家演示中模仿学习,还是基于同一离线数据集的离线政策优化?通过利用标准的离线数据设置进行各种实验,我们发现,使用未经监督的学习目标进行预先培训,可以大大改进政策学习算法的性,否则会自行产生中度业绩。通过广泛整理,进一步深入了解这些未监督目标的构成哪些组成部分 -- 例如:在线的RL,从专家演示中模仿,还是基于同一离线数据设置的离线政策优化,在最重要的预测、持续或离线的演练中进行最重要的演练。