This paper finds that contrastive learning can produce superior sentence embeddings for pre-trained models but is also vulnerable to backdoor attacks. We present the first backdoor attack framework, BadCSE, for state-of-the-art sentence embeddings under supervised and unsupervised learning settings. The attack manipulates the construction of positive and negative pairs so that the backdoored samples have a similar embedding with the target sample (targeted attack) or the negative embedding of its clean version (non-targeted attack). By injecting the backdoor in sentence embeddings, BadCSE is resistant against downstream fine-tuning. We evaluate BadCSE on both STS tasks and other downstream tasks. The supervised non-targeted attack obtains a performance degradation of 194.86%, and the targeted attack maps the backdoored samples to the target embedding with a 97.70% success rate while maintaining the model utility.
翻译:本文发现,对比式学习可以产生高级刑罚嵌入训练前模型,但也容易受到后门攻击。 我们展示了第一个最先进的后门攻击框架,即BadCSE,用于在监管和不受监管的学习环境中嵌入最先进的后门攻击框架,即BadCSE。 这次攻击操纵了正对和负对的构造,使后门样品与目标样本(定向攻击)或清洁版本(非定向攻击)的负嵌入相类似。通过将后门嵌入刑期嵌入,BadCSE对下游微调具有抗药性。 我们在STS任务和其他下游任务中评估BadCSE。 受监管的非目标攻击的性能退化率为194.86%,而有针对性的攻击则将后门样品与目标相嵌入97.70%的成功率,同时保持模型实用性。