Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]" by using similarity between the image and the prompt sentence "a [CONTEXT] of [CLASS]". Based on exhaustive text cues in "[CONTEXT]", CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback-Leibler Divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher In-Distribution (ID) and Out-Of-Distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous Domain Generalization (DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building the new state-of-the-art.
翻译:(cLIP) 模型根据“[CONTEXT] ” 中的详尽文本提示,了解不同的背景,例如背景、风格、观点,并显示对广泛分布变化的空前稳健性。然而,最近的工作发现,对CLIP 模型的进一步微调提高了准确性,但牺牲了下游任务的稳健性。我们进行了实验性调查,以显示微调将会腐蚀预先训练的CLIP 特性的“CONTEXT] ” 快速句子和“CLOSS” 。根据“[CONTEXT] ” 中的详尽文本提示,CLIP 模型了解了不同的背景,例如背景、风格、观点、观点、以及对广泛分布的图像进行更精确性调整。 通过将 KullB-LIIP 下游任务中的下游任务(CRBRFD) 和CRFT 的上下游成本(C-DRFD) 实现原始的上上下游的上下流、上下流、下游的C-DFD(C-LD-LD) 的上下流的上下流的上下流的上下流、下流的上下流的上下流的上下流、下流的上流、下流、下流的上下流的上流的上流的上流的上流的上流的下流的下流的下流数据, 流的上流的上流的上流的上流的上流的下流的下流(C-调)