The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, ie, full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training data scales, while reducing per-task storage cost.
翻译:目前调整预先培训模式的工作方式包括更新所有主干参数,即全面微调。本文介绍视觉即时调试(VPT)是大型变异器愿景模型全面调整的高效和有效替代方法。根据最近在高效调整大型语言模型方面取得的进展,VPT在输入空间只引入少量(低于模型参数的1%)可培训参数,同时保持模型主干冻结。通过对一系列广泛的下游识别任务的广泛实验,我们显示VPT与其他参数高效调控协议相比取得了显著的业绩收益。最重要的是,VPT甚至在许多情况下超越了模型能力和培训数据尺度的全面调整,同时降低了每个任务单位的存储成本。