Recent years have seen the emergence of pre-trained representations as a powerful abstraction for AI applications in computer vision, natural language, and speech. However, policy learning for control is still dominated by a tabula-rasa learning paradigm, with visuo-motor policies often trained from scratch using data from deployment environments. In this context, we revisit and study the role of pre-trained visual representations for control, and in particular representations trained on large-scale computer vision datasets. Through extensive empirical evaluation in diverse control domains (Habitat, DeepMind Control, Adroit, Franka Kitchen), we isolate and study the importance of different representation training methods, data augmentations, and feature hierarchies. Overall, we find that pre-trained visual representations can be competitive or even better than ground-truth state representations to train control policies. This is in spite of using only out-of-domain data from standard vision datasets, without any in-domain data from the deployment environments. Source code and more at https://sites.google.com/view/pvr-control.
翻译:近些年来,培训前的表述方式已成为计算机视觉、自然语言和语言中AI应用的强大抽象形式,然而,控制政策学习仍然以塔布拉-拉萨学习范式为主,对动动政策的训练往往是利用部署环境的数据从零开始。在这方面,我们重新审视和研究培训前的视觉表述方式的作用,特别是大规模计算机视觉数据集培训的作用。通过在不同控制领域(生境、深海控制、Adroit、Franka Kitchen)的广泛经验评估,我们分离和研究不同代表性培训方法、数据增强和特征等级的重要性。总的来说,我们发现,培训前的视觉表述方式比地面的状态展示方式更具有竞争力,甚至比培训控制政策所需要的地面的状态展示方式更好。尽管我们只使用标准视觉数据集的外部数据,而没有来自部署环境的任何内部数据。资料来源代码和更多见https://sites.gogle.com/view/pvr-contr control。