Self-supervised learning in computer vision trains on unlabeled data, such as images or (image, text) pairs, to obtain an image encoder that learns high-quality embeddings for input data. Emerging backdoor attacks towards encoders expose crucial vulnerabilities of self-supervised learning, since downstream classifiers (even further trained on clean data) may inherit backdoor behaviors from encoders. Existing backdoor detection methods mainly focus on supervised learning settings and cannot handle pre-trained encoders especially when input labels are not available. In this paper, we propose DECREE, the first backdoor detection approach for pre-trained encoders, requiring neither classifier headers nor input labels. We evaluate DECREE on over 400 encoders trojaned under 3 paradigms. We show the effectiveness of our method on image encoders pre-trained on ImageNet and OpenAI's CLIP 400 million image-text pairs. Our method consistently has a high detection accuracy even if we have only limited or no access to the pre-training dataset.
翻译:自监督学习在计算机视觉中使用无标签数据(如图像或(图像,文本)对)训练图像编码器,以获取对输入数据的高质量嵌入。针对编码器的新型后门攻击暴露了自监督学习的重要漏洞,因为下游分类器(甚至在干净数据上进一步训练的分类器)可能会从编码器中继承后门行为。现有的后门检测方法主要集中在监督学习环境中,并且不能处理预训练编码器,特别是当没有输入标签时。在本文中,我们提出了 DECREE,这是首个用于预训练编码器的后门检测方法,不需要分类器标头或输入标签。我们在超过 400 个在三种范例下被欺诈的编码器上评估了 DECREE。我们展示了我们的方法在预训练于 ImageNet 和 OpenAI 的 CLIP 400 百万图像文本对的图像编码器上的效果。即使我们仅有有限或没有访问预训练数据集,我们的方法始终具有很高的检测准确率。