Fine-tuning reinforcement learning (RL) models has been challenging because of a lack of large scale off-the-shelf datasets as well as high variance in transferability among different environments. Recent work has looked at tackling offline RL from the perspective of sequence modeling with improved results as result of the introduction of the Transformer architecture. However, when the model is trained from scratch, it suffers from slow convergence speeds. In this paper, we look to take advantage of this formulation of reinforcement learning as sequence modeling and investigate the transferability of pre-trained sequence models on other domains (vision, language) when finetuned on offline RL tasks (control, games). To this end, we also propose techniques to improve transfer between these domains. Results show consistent performance gains in terms of both convergence speed and reward on a variety of environments, accelerating training by 3-6x and achieving state-of-the-art performance in a variety of tasks using Wikipedia-pretrained and GPT2 language models. We hope that this work not only brings light to the potentials of leveraging generic sequence modeling techniques and pre-trained models for RL, but also inspires future work on sharing knowledge between generative modeling tasks of completely different domains.
翻译:微调强化学习(RL)模型之所以具有挑战性,是因为缺乏大规模现成的现成数据集以及不同环境之间在可转移性方面的差异很大。最近的工作是从采用变换器结构后,以更好的结果建模的顺序模型来看待脱线RL的处理,但当模型从零开始培训时,其趋同速度缓慢。在本文件中,我们期待利用这一强化学习的提法作为序列模型,并在对脱线的RL任务(控制、游戏)进行微调时,调查其他领域(视觉、语言)预先培训的序列模型的可转移性。为此,我们还提议了改进这些领域之间转让的技术。结果显示,在各种环境的趋同速度和奖励方面,在加速3-6x培训方面,以及在利用维基百科培训的和GPT2语言模型完成各种任务方面,在取得最新业绩方面,取得了一致的成绩。我们希望,这项工作不仅能够凸显利用通用的序列模型和预先训练的RL模型的潜力,而且还能够激发今后在不同的模型之间分享知识领域之间进行彻底的工作。