As an emerging machine learning paradigm, self-supervised learning (SSL) is able to learn high-quality representations for complex data without data labels. Prior work shows that, besides obviating the reliance on labeling, SSL also benefits adversarial robustness by making it more challenging for the adversary to manipulate model prediction. However, whether this robustness benefit generalizes to other types of attacks remains an open question. We explore this question in the context of trojan attacks by showing that SSL is comparably vulnerable as supervised learning to trojan attacks. Specifically, we design and evaluate CTRL, an extremely simple self-supervised trojan attack. By polluting a tiny fraction of training data (less than 1%) with indistinguishable poisoning samples, CTRL causes any trigger-embedded input to be misclassified to the adversary's desired class with a high probability (over 99%) at inference. More importantly, through the lens of CTRL, we study the mechanisms underlying self-supervised trojan attacks. With both empirical and analytical evidence, we reveal that the representation invariance property of SSL, which benefits adversarial robustness, may also be the very reason making SSL highly vulnerable to trojan attacks. We further discuss the fundamental challenges to defending against self-supervised trojan attacks, pointing to promising directions for future research.
翻译:作为新兴的机器学习模式,自我监督的学习(SSL)能够学习没有数据标签的复杂数据的高质量表现。先前的工作表明,除了避免依赖标签之外,SSL还可以使对手更难操纵模型预测,从而有利于对抗性强健。然而,这种稳健性是否有利于推广到其他类型的攻击,仍然是一个尚未解决的问题。我们在热带攻击背景下探讨这个问题,显示SSL比受监督的学习更容易受到天体攻击的伤害。具体地说,我们设计和评价CTRL,这是一次极其简单的自我监督的天体攻击。通过污染一小部分训练数据(不到1 % ) 和不可分辨的中毒样本,SSL还有利于对抗性攻击,CTRL造成任何触发性投入被错误地分解到对手所期望的类别,极有可能(超过99%)推断。更重要的是,通过CTRL,我们研究了自我监督的天体攻击背后的脆弱机制。我们通过实验和分析证据,通过污染微小部分的培训数据(不到1 % ) 污染了微的训练数据,我们揭示了SL的自我防御性攻击背后的弱点。