The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its application to diverse downstream vision tasks. To improve its capacity on downstream tasks, few-shot learning has become a widely-adopted technique. However, existing methods either exhibit limited performance or suffer from excessive learnable parameters. In this paper, we propose APE, an Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which achieves superior accuracy with high computational efficiency. Via a prior refinement module, we analyze the inter-class disparity in the downstream data and decouple the domain-specific knowledge from the CLIP-extracted cache model. On top of that, we introduce two model variants, a training-free APE and a training-required APE-T. We explore the trilateral affinities between the test image, prior cache model, and textual representations, and only enable a lightweight category-residual module to be trained. For the average accuracy over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively outperform the second-best by +1.59% and +1.99% under 16 shots with x30 less learnable parameters.
翻译:CLIP(Contrastive Language-Image Pre-training)的普及已经推动了其在各种下游视觉任务中的应用。为了提高其在下游任务中的能力,少样本学习已经成为一种被广泛采用的技术。然而,现有的方法要么表现出有限的性能,要么受到过多可学习参数的困扰。本文中,我们提出了基于自适应先验细化的APE(Adaptive Prior rEfinement)方法,用于CLIP的预训练知识,它实现了高性能和高计算效率的平衡。通过一个先验细化模块,我们分析了下游数据中的跨类别差异,并将领域特定的知识与CLIP提取的缓存模型分离出来。在此基础上,我们引入了两个模型变量,一个是无需训练的APE,另一个是需要训练的APE-T。我们探讨了测试图像、先前的缓存模型和文本表示之间的三边联系,并只开启了一个轻量级的类别残差模块进行训练。在11个基准测试中的平均精度方面,APE和APE-T都达到了最新水平,并分别以少40倍的可学习参数和+1.59%和+1.99%的准确度优于第二佳(在16个样本中)。