Prompt tuning, a parameter- and data-efficient transfer learning paradigm that tunes only a small number of parameters in a model's input space, has become a trend in the vision community since the emergence of large vision-language models like CLIP. We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on few-shot learning benchmarks, as well as on domain generalization benchmarks. Code and models will be released to facilitate future research.
翻译:快速调试是一种参数和数据效率高的传输学习模式,在模型输入空间中只调用少量参数,自大型视觉语言模型(如CLIP)出现以来,它已成为视觉界的一个趋势。我们对两种具有代表性的快速调试方法,即文本快速调试和视觉快速调试进行了系统的研究。一项主要发现是,单一方式快速调试方法没有一个能始终如一地运行良好:文本快速调试在具有高等级内部视觉差异的数据上失败,而视觉快速调试无法处理低等级差异。为了将来自世界的最好因素结合起来,我们提出了一种简单的方法,即统一快速调试(UPT),基本上学习一个小型神经网络,以联合优化不同模式的快速调试。对超过11个视觉数据集的广泛实验表明,UPT在少见的学习基准以及域通用基准上比单一方式的对应方实现更好的交换。将发布代码和模型,以便利未来的研究。