Pretrained large-scale vision-language models like CLIP have exhibited strong generalization over unseen tasks. Yet imperceptible adversarial perturbations can significantly reduce CLIP's performance on new tasks. In this work, we identify and explore the problem of \emph{adapting large-scale models for zero-shot adversarial robustness}. We first identify two key factors during model adaption -- training losses and adaptation methods -- that affect the model's zero-shot adversarial robustness. We then propose a text-guided contrastive adversarial training loss, which aligns the text embeddings and the adversarial visual features with contrastive learning on a small set of training data. We apply this training loss to two adaption methods, model finetuning and visual prompt tuning. We find that visual prompt tuning is more effective in the absence of texts, while finetuning wins in the existence of text guidance. Overall, our approach significantly improves the zero-shot adversarial robustness over CLIP, seeing an average improvement of over 31 points over ImageNet and 15 zero-shot datasets. We hope this work can shed light on understanding the zero-shot adversarial robustness of large-scale models.
 翻译:预训练的大规模视觉-语言模型(如CLIP)展现出在未知任务上的强大泛化能力。然而,微小的对抗扰动可以显著降低CLIP在新任务上的表现。在这项工作中,我们确定并探索了针对零样本对抗鲁棒性的大规模模型的问题。我们首先确定了两个关键因素,即训练损失和适应方法,它们影响模型的零样本对抗鲁棒性。然后,我们提出了一个基于文本引导的对比对抗训练损失,该损失在少量训练数据上将文本嵌入和对抗视觉特征进行对比学习。我们将这种训练损失应用于两种适应方法:模型微调和视觉提示微调。我们发现,在没有文本的情况下,视觉提示微调更有效,在存在文本引导的情况下,微调胜出。总体而言,我们的方法显着提高了CLIP的零样本对抗鲁棒性,在ImageNet和15个零样本数据集上平均提高了31个点。我们希望这项工作能够有助于理解大规模模型的零样本对抗鲁棒性。