Traditional machine learning (ML) models usually rely on large-scale labeled datasets to achieve strong performance. However, such labeled datasets are often challenging and expensive to obtain. Also, the predefined categories limit the model's ability to generalize to other visual concepts as additional labeled data is required. On the contrary, the newly emerged multimodal model, which contains both visual and linguistic modalities, learns the concept of images from the raw text. It is a promising way to solve the above problems as it can use easy-to-collect image-text pairs to construct the training dataset and the raw texts contain almost unlimited categories according to their semantics. However, learning from a large-scale unlabeled dataset also exposes the model to the risk of potential poisoning attacks, whereby the adversary aims to perturb the model's training dataset to trigger malicious behaviors in it. Previous work mainly focuses on the visual modality. In this paper, we instead focus on answering two questions: (1) Is the linguistic modality also vulnerable to poisoning attacks? and (2) Which modality is most vulnerable? To answer the two questions, we conduct three types of poisoning attacks against CLIP, the most representative multimodal contrastive learning framework. Extensive evaluations on different datasets and model architectures show that all three attacks can perform well on the linguistic modality with only a relatively low poisoning rate and limited epochs. Also, we observe that the poisoning effect differs between different modalities, i.e., with lower MinRank in the visual modality and with higher Hit@K when K is small in the linguistic modality. To mitigate the attacks, we propose both pre-training and post-training defenses. We empirically show that both defenses can significantly reduce the attack performance while preserving the model's utility.
翻译:传统机器学习模式通常依靠大型标签标签的图像数据集才能取得强劲的性能。 但是,这类标签的数据集往往具有挑战性,而且获取的成本也很高。 此外,预设的类别限制了模型在需要额外标签数据的情况下推广到其他视觉概念的能力。 相反,新形成的包含视觉和语言模式的多式联运模式,从原始文本中学习图像的概念。这是解决上述问题的有希望的方法,因为它可以使用简单到收集的图像文本组合来构建培训数据集,而原始文本则根据其语义,包含几乎不受限制的类别。然而,从大规模无标签的数据集中学习模型将模型推广到其他视觉概念概念,因为根据额外的标签数据模式,敌人的目的是破坏模型的培训数据集,从而引发恶意行为。以前的工作主要侧重于视觉模式。在本文中,我们侧重于回答两个问题:(1) 语言模式是否也容易中毒攻击? 和(2) 哪种模式最为脆弱? 要回答这两个问题,我们从大规模无标签的用户模式中提出三种攻击方式,我们用相对的CLIP模式进行着毒害性攻击,同时用最有代表性的CLIP框架进行三种攻击,我们用最有代表性的模型进行最有代表性的模型进行不同的攻击,只能在测试中进行最有代表性的模型中进行有代表性的攻击。