Recent works have shown that large models pretrained on common visual learning tasks can provide useful representations for a wide range of specialized perception problems, as well as a variety of robotic manipulation tasks. While prior work on robotic manipulation has predominantly used frozen pretrained features, we demonstrate that in robotics this approach can fail to reach optimal performance, and that fine-tuning of the full model can lead to significantly better results. Unfortunately, fine-tuning disrupts the pretrained visual representation, and causes representational drift towards the fine-tuned task thus leading to a loss of the versatility of the original model. We introduce "lossless adaptation" to address this shortcoming of classical fine-tuning. We demonstrate that appropriate placement of our parameter efficient adapters can significantly reduce the performance gap between frozen pretrained representations and full end-to-end fine-tuning without changes to the original representation and thus preserving original capabilities of the pretrained model. We perform a comprehensive investigation across three major model architectures (ViTs, NFNets, and ResNets), supervised (ImageNet-1K classification) and self-supervised pretrained weights (CLIP, BYOL, Visual MAE) in 3 task domains and 35 individual tasks, and demonstrate that our claims are strongly validated in various settings.
翻译:近期的研究表明,预训练的大型模型在常见的视觉学习任务中提供的有用表示可以应用于各种专业感知问题以及各种机器人操作任务。虽然先前研究主要使用冻结的预训练特征进行机器人操作,但我们证明在机器人领域,这种方法可能无法达到最优性能,并且精调模型的整体模型可以导致显着更好的结果。不幸的是,微调会破坏预训练的视觉表示,并导致表示向微调任务漂移,从而丢失预训练模型的多功能性。我们引入“无损调适”来解决经典微调的这种缺陷。我们演示了适当放置我们的参数效率适配器可以显着减少冻结的预训练表示和全端到端微调的性能差距,并保留原始模型的原始表示,从而保留预训练模型的原始功能。我们在三个主要模型体系结构(ViTs、NFNets 和 ResNets)、Supervised(ImageNet-1K 分类) 和 self-supervised pretrained weights (CLIP、BYOL 和 Visual MAE)在3个任务领域和35个单独任务上进行了全面调查,并证明了我们的主张在各种设置中都得到了强有力的验证。