Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works like WiseFT (Wortsman et al., 2021) and LP-FT (Kumar et al., 2022) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shifts, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by $2.3\%$ ID and $2.7\%$ OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of $4.2\%$ OOD over standard finetuning and outperforms the current state of the art (LP-FT) by more than $1\%$ both ID and OOD. Similarly, on 3 few-shot learning benchmarks, our approach gives gains up to $4.6\%$ over standard finetuning and $4.4\%$ over the state of the art. In total, these benchmarks establish contrastive finetuning as a simple, intuitive, and state-of-the-art approach for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP.
翻译:微调图像文本模型(如 CLIP ) 等微调图像文本模型( 如 CLIP ) 在各种基准上达到最先进的比方。 然而, 最近的一些工程( 如 WiseFT( Wortsman et al., 2021) 和 LP-FT( Kumar et al., 2022)) 显示, 微调过程中的细微差异甚至会导致最终性能出现令人惊讶的巨大差异, 无论是在分布( ID) 和分配( OOD) 数据上, 我们提议的对比培训前的自然和简单方法( 模拟培训前的方法) 总是比其他的更精确的更精确的比方。 具体地说, 我们的下游类标签作为文本提示( 沃茨曼等人等人 等) 和 LP- 等的比方的更精确的比方值( 美元) 。 最高水平的 ODODA 和 最高的比值( 4DFD ) 标准值 的比值( 美元) 最高的比值( 美元) IMD) 最高的比值( 3) 和最高的比值( IMDAD) 4) 标准的比值的比值( IMDD) 4) 和最高的比值的比值( IMD 4) 4 ( 4) 最高的比值( 4) 4) 和最高的比值( ODA 4。