Contrastive language-image pre-training (CLIP) has shown remarkable generalization ability in image classification. However, CLIP sometimes encounters performance drops on downstream datasets during zero-shot inference. Test-time adaptation methods attempt to mitigate this by adjusting normalization layers or tuning context prompts with large batch sizes and extensive augmentations; yet, these methods are computationally intensive. This raises an important question: Is there a training-free approach that can efficiently address CLIP's performance drop in such cases? To explore this, we benchmark token condensation techniques, originally designed to enhance the efficiency of vision transformers, on CLIP zero-shot inference tasks. We observe that although token condensation may compromise in-domain accuracy, it surprisingly enhances CLIP's performance on certain cross-dataset benchmarks. This motivates two key inquiries: (1) Can token condensation serve as a "free-lunch" solution for CLIP zero-shot inference? (2) What criteria should guide condensation -- how can essential tokens be identified and redundant ones eliminated? To address these questions, we propose Token Condensation as Adaptation (TCA), a training-free adaptation method for CLIP by pruning class-irrelevant visual tokens while merging class-ambiguous tokens. As the first approach for CLIP's token efficiency, TCA demonstrates superior performance across cross-dataset tasks, achieving up to a 21.4\% improvement over the strongest baseline while reducing GFLOPs by 12.2\% to 48.9\%, with minimized hyperparameter dependency.
翻译:暂无翻译