Modern machine learning increasingly requires training on a large collection of data from multiple sources, not all of which can be trusted. A particularly concerning scenario is when a small fraction of poisoned data changes the behavior of the trained model when triggered by an attacker-specified watermark. Such a compromised model will be deployed unnoticed as the model is accurate otherwise. There have been promising attempts to use the intermediate representations of such a model to separate corrupted examples from clean ones. However, these defenses work only when a certain spectral signature of the poisoned examples is large enough for detection. There is a wide range of attacks that cannot be protected against by the existing defenses. We propose a novel defense algorithm using robust covariance estimation to amplify the spectral signature of corrupted data. This defense provides a clean model, completely removing the backdoor, even in regimes where previous methods have no hope of detecting the poisoned examples. Code and pre-trained models are available at https://github.com/SewoongLab/spectre-defense .
翻译:现代机器学习日益需要培训,以便从多种来源收集大量数据,但并非所有数据都是可以信任的。一个特别的情景是,在攻击者指定的水印触发时,一小部分有毒数据改变了经过训练的模型的行为。这种被破坏的模型会被忽略,因为模型在其他情况下是准确的。人们曾有大有希望地试图利用这种模型的中间表示方式将腐败的例子与干净的例子区分开来。然而,这些防御只有在有毒例子的某种光谱特征足以探测出来时才起作用。现有防御无法保护广泛的攻击。我们提议采用新的防御算法,使用强有力的共变法估计来扩大被破坏的数据的光谱特征。这种辩护提供了一个干净的模型,完全清除后门,即使在以前的方法没有希望发现有毒例子的制度中也是如此。可在https://github.com/SewoongLab/specetrefement上找到代码和经过预先训练的模型。