Data efficiency is a key challenge for deep reinforcement learning. We address this problem by using unlabeled data to pretrain an encoder which is then finetuned on a small amount of task-specific data. To encourage learning representations which capture diverse aspects of the underlying MDP, we employ a combination of latent dynamics modelling and unsupervised goal-conditioned RL. When limited to 100k steps of interaction on Atari games (equivalent to two hours of human experience), our approach significantly surpasses prior work combining offline representation pretraining with task-specific finetuning, and compares favourably with other pretraining methods that require orders of magnitude more data. Our approach shows particular promise when combined with larger models as well as more diverse, task-aligned observational data -- approaching human-level performance and data-efficiency on Atari in our best setting. We provide code associated with this work at https://github.com/mila-iqia/SGI.
翻译:数据效率是深层强化学习的关键挑战。 我们通过使用未贴标签的数据来预设一个编码器来解决这一问题,该编码器随后对少量任务特定数据进行微调。 为了鼓励反映基本 MDP 各个方面的学习表现,我们采用潜伏动态建模和未经监督的有目标的RL 组合。 当Atari 游戏上的互动步骤限于100公里(相当于人类经验的两小时)时,我们的方法大大超过先前的工作,将离线代表预培训与任务特定微调相结合,并优于其他需要数量级数据的培训前方法。我们的方法显示,如果与更大的模型以及更多样化、任务一致的观测数据相结合,我们特别有希望 -- -- 在我们的最佳环境中接近人类层面的业绩和Atari的数据效率。我们在https://github.com/mila-iqia/SGI提供与这项工作有关的代码。