Vision-language models have recently shown great potential on many computer vision tasks. Meanwhile, prior work demonstrates prompt tuning designed for vision-language models could acquire superior performance on few-shot image recognition compared to linear probe, a strong baseline. In real-world applications, many few-shot tasks are correlated, particularly in a specialized area. However, such information is ignored by previous work. Inspired by the fact that modeling task relationships by multi-task learning can usually boost performance, we propose a novel method SoftCPT (Soft Context Sharing for Prompt Tuning) to fine-tune pre-trained vision-language models on multiple target few-shot tasks, simultaneously. Specifically, we design a task-shared meta network to generate prompt vector for each task using pre-defined task name together with a learnable meta prompt as input. As such, the prompt vectors of all tasks will be shared in a soft manner. The parameters of this shared meta network as well as the meta prompt vector are tuned on the joint training set of all target tasks. Extensive experiments on three multi-task few-shot datasets show that SoftCPT outperforms the representative single-task prompt tuning method CoOp [78] by a large margin, implying the effectiveness of multi-task learning in vision-language prompt tuning. The source code and data will be made publicly available.
翻译:视觉语言模型最近在许多计算机愿景任务上显示出巨大的潜力。 与此同时,先前的工作表明,为视觉语言模型设计的快速调整能够获得与直线探测器(一个强大的基线)相比的微小图像识别的优异性能。在现实世界应用程序中,许多微小任务是相互关联的,特别是在一个专门领域。然而,以往的工作忽视了这些信息。由于通过多任务学习建立模型任务关系通常能够提高性能,因此我们建议一种创新方法SoftCPT(Sft control for Expregial Turning),以便同时在多目标点点点任务上,微调预先训练的视觉语言模型。具体地说,我们设计了一个共享的元数据网络,以便利用预先确定的任务名称和可学习的元代码作为投入,为每项任务生成迅速的矢量矢量。因此,所有任务的快速矢量将以一种软的方式共享。这个共享的元网络的参数以及元速量矢量矢量将调整为所有目标任务的联合培训设置。关于三个多点前训练的微小任务预设的视觉数据集的广泛实验将显示,SftCPT超越了StoftCPT的大规模视野模型,从而将展示一个具有代表性的快速的模型,从而将快速地将展示一个具有代表性的模型,从而将展示一个具有代表性的模型。