Inspired by the recent success of sequence modeling in RL and the use of masked language model for pre-training, we propose a masked model for pre-training in RL, RePreM (Representation Pre-training with Masked Model), which trains the encoder combined with transformer blocks to predict the masked states or actions in a trajectory. RePreM is simple but effective compared to existing representation pre-training methods in RL. It avoids algorithmic sophistication (such as data augmentation or estimating multiple models) with sequence modeling and generates a representation that captures long-term dynamics well. Empirically, we demonstrate the effectiveness of RePreM in various tasks, including dynamic prediction, transfer learning, and sample-efficient RL with both value-based and actor-critic methods. Moreover, we show that RePreM scales well with dataset size, dataset quality, and the scale of the encoder, which indicates its potential towards big RL models.
翻译:受最近成功在RL进行序列建模和使用蒙面语言模型进行预培训的启发,我们提出了一个在RL、RePREM(使用蒙面模型进行代表培训前培训)进行预培训的蒙面模型,该模型将编码器与变压器块结合起来,对隐面状态或行动进行轨迹的预测。RePREM与RL的现有代表预培训方法相比简单而有效。它避免了算法精度(如数据扩增或估计多重模型),并生成了一种能捕捉长期动态的代号。从中可以想象到RePREM在各种任务中的有效性,包括动态预测、转移学习和样本高效的RL,同时使用基于价值的方法和行为体-critic方法。此外,我们展示了RePREPREM在数据集大小、数据集质量和编码仪的规模方面精细,表明其向大型RL模型发展的潜力。</s>