Contrastive vision-language representation learning has achieved state-of-the-art performance for zero-shot classification, by learning from millions of image-caption pairs crawled from the internet. However, the massive data that powers large multimodal models such as CLIP, makes them extremely vulnerable to various types of adversarial attacks, including targeted and backdoor data poisoning attacks. Despite this vulnerability, robust contrastive vision-language pretraining against adversarial attacks has remained unaddressed. In this work, we propose RoCLIP, the first effective method for robust pretraining {and fine-tuning} multimodal vision-language models. RoCLIP effectively breaks the association between poisoned image-caption pairs by considering a pool of random examples, and (1) matching every image with the text that is most similar to its caption in the pool, and (2) matching every caption with the image that is most similar to its image in the pool. Our extensive experiments show that our method renders state-of-the-art targeted data poisoning and backdoor attacks ineffective during pre-training or fine-tuning of CLIP. In particular, RoCLIP decreases the poison and backdoor attack success rates down to 0\% during pre-training and 1\%-4\% during fine-tuning, and effectively improves the model's performance.
翻译:通过学习从互联网上爬来的成百上千万个图像插图配对,对比式的视觉语言描述学习实现了最先进的零光分类表现,然而,巨大的数据使大型多式模型(如CLIP)极易受到各种类型的对抗性攻击,包括定向和后门数据中毒攻击。尽管存在这种脆弱性,针对对抗性攻击的强烈对比式视觉语言前训练仍未得到解决。在这项工作中,我们提出“ROCLIP”,这是强力训练前{和微调}多式视觉语言模型的第一个有效方法。“ROCLIP”通过考虑随机实例,有效地打破了有毒图像插图配对组合之间的联系,以及(1) 将每个图像与最接近于该组合的文字匹配,以及(2) 将每个图像与最接近其在知识库中的形象匹配。我们的广泛实验表明,我们的方法使得在CLIP的训练前或微调期间,最先进的数据定点中毒和后门攻击无效。特别是,RoCLIP将毒和后门攻击成功率降低到培训前的微调低,在培训前和1__。</s>