Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.
翻译:有关噪音、互联网规模数据集的预培训已经作为培训模型的一种技术进行了大量研究。但是,对于许多连续决策领域,如机器人、视频游戏和计算机使用等,公开数据并不包含以同样方式培训行为前科所需的标签。我们通过半监督的模拟学习,将互联网规模预培训模式扩大到连续决策领域,使代理商学会通过观看在线未贴标签的视频来采取行动。具体地说,我们显示,只要有少量贴标签的数据,我们就能训练一个反动态模型,准确到可以标出一个巨大的未贴标签的在线数据来源 -- -- 这里,是玩Minecraft(Minecraft)的人的在线视频 -- -- 之后,我们可以从中先训练一般行为。尽管我们使用了本地人类界面(20Hz的移动和键盘),但我们显示,这一行为前没有微的零弹能力,而且可以通过模仿学习和强化学习来进行精确调整,到硬爆破任务,我们无法通过强化学习从零抓学来学习。对于许多模型展示人类游戏工具的任务来说,我们可以通过20分钟的钻钻机操作进行。