We are interested in learning scalable agents for reinforcement learning that can learn from large-scale, diverse sequential data similar to current large vision and language models. To this end, this paper presents masked decision prediction (MaskDP), a simple and scalable self-supervised pretraining method for reinforcement learning (RL) and behavioral cloning (BC). In our MaskDP approach, we employ a masked autoencoder (MAE) to state-action trajectories, wherein we randomly mask state and action tokens and reconstruct the missing data. By doing so, the model is required to infer masked-out states and actions and extract information about dynamics. We find that masking different proportions of the input sequence significantly helps with learning a better model that generalizes well to multiple downstream tasks. In our empirical study, we find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching, and it can zero-shot infer skills from a few example transitions. In addition, MaskDP transfers well to offline RL and shows promising scaling behavior w.r.t. to model size. It is amenable to data-efficient finetuning, achieving competitive results with prior methods based on autoregressive pretraining.
翻译:我们感兴趣的是学习可扩增学习的可扩缩剂,从类似于当前大型愿景和语言模型的大规模、多样的相继数据中学习。为此,本文件介绍了隐蔽的决定预测(MaskDP),这是一个简单且可扩缩的自我监督的强化学习和行为克隆(BC)培训前方法。在我们的MaskDP方法中,我们使用一个蒙面自动校验器(MAE)到州行动轨迹,其中我们随机遮盖状态和动作符号,并重建缺失的数据。为此,模型需要推断隐藏的状态和行动,并提取动态信息。我们发现,掩盖不同比例的投入序列极大地有助于学习一种更好的模式,能够将多个下游任务综合起来。在我们的实验研究中,我们发现一个蒙面自动校验器模型能够将零发换到新的公元件任务,例如单项和多项目标达到,并且可以从几个例子转换中零发回技能。此外,MaskDP还可以向离线的隐藏状态和动作,并展示有希望的升级行为方式,从而实现具有竞争力的升级前的自我调整。