Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations by using large-scale contrastive image-text pairs. It shows impressive performance on zero-shot knowledge transfer to downstream tasks. To further enhance CLIP's few-shot capability, CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and significantly improves the performance for few-shot classification. However, such a process still needs extra training and computational resources. In this paper, we propose \textbf{T}raining-Free CL\textbf{IP}-\textbf{Adapter} (\textbf{Tip-Adapter}), which not only inherits CLIP's training-free advantage but also performs comparably or even better than CLIP-Adapter. Tip-Adapter does not require any back propagation for training the adapter, but creates the weights by a key-value cache model constructed from the few-shot training set. In this non-parametric manner, Tip-Adapter acquires well-performed adapter weights without any training, which is both efficient and effective. Moreover, the performance of Tip-Adapter can be further boosted by fine-tuning such properly initialized adapter for only a few epochs with super-fast convergence speed. We conduct extensive experiments of few-shot classification on ImageNet and other 10 datasets to demonstrate the superiority of proposed Tip-Adapter. The code will be released at \url{https://github.com/gaopengcuhk/Tip-Adapter}.
翻译:名为 CLIP (CLIP) 的培训前预演为学习视觉表现提供了一种新的范例。 它为学习视觉表现提供了一种新的范例, 使用大型对比图像- 文本配对 。 它展示了零光知识传输到下游任务的令人印象深刻的绩效 。 为了进一步加强 CLIP 的少发能力, CLIP- Adapter 提议微调一个轻量剩余特性调整器, 并大大改进微调分类的性能 。 但是, 此进程仍然需要额外的培训和计算资源 。 在本文中, 我们提议使用一个粗略的 CLTUF- Textbf{IP}\ textbf{ Ipf{Adapf{ (\ textbf{Adapter}) 来进行视觉表现。 不仅继承了 CLIP 的少发性能功能调整功能调整器的优点, 而且比 CLIP- Adapter 级校正校正的性调整程序还要精细, 也可以通过一个精细的精细的快速的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正, 。