In computer vision, fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks. However, deploying it in practice is quite challenging, due to adopting parameter inefficient global update and heavily relying on high-quality downstream data. Recently, prompt-based learning, which adds a task-relevant prompt to adapt the downstream tasks to pre-trained models, has drastically boosted the performance of many natural language downstream tasks. In this work, we extend this notable transfer ability benefited from prompt into vision models as an alternative to fine-tuning. To this end, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks. The key to Pro-tuning is prompt-based tuning, i.e., learning task-specific vision prompts for downstream input images with the pre-trained model frozen. By only training a few additional parameters, it can work on diverse CNN-based and Transformer-based architectures. Extensive experiments evidence that Pro-tuning outperforms fine-tuning in a broad range of vision tasks and scenarios, including image classification (generic objects, class imbalance, image corruption, adversarial robustness, and out-of-distribution generalization), and dense prediction tasks such as object detection and semantic segmentation.
翻译:在计算机愿景中,微调是利用预先培训的愿景模型来完成下游任务的一种不简单的方法,但在实际中,部署该功能是相当具有挑战性的,因为采用了低效率的全球更新参数,并严重依赖高质量的下游数据。最近,基于迅速的学习,增加了与任务相关的及时性,使下游任务适应经过培训的模型,极大地提高了许多自然语言下游任务的绩效。在这项工作中,我们将这种显著的转移能力从迅速的转换能力扩大到愿景模型,作为微调的替代。为此,我们建议采用节能的快速调控(快速调控),使冷冻的愿景模型适应各种下游愿景任务。 Pro调的关键是基于快速的调控,即学习特定任务性愿景,通过对经过事先培训的模式的下游投入图像进行快速调,仅培训少数额外的参数,就能使基于CNN和基于变压器的架构发挥作用。我们提出了广泛的实验证据,在广泛的愿景任务和情景中进行微调,包括图像分类(基因物体、阶级失衡、高压性图像检测、高压性图像分析任务和高压性平流。