Adversarial training is widely believed to be a reliable approach to improve model robustness against adversarial attack. However, in this paper, we show that when trained on one type of poisoned data, adversarial training can also be fooled to have catastrophic behavior, e.g., $<1\%$ robust test accuracy with $>90\%$ robust training accuracy on CIFAR-10 dataset. Previously, there are other types of noise poisoned in the training data that have successfully fooled standard training ($15.8\%$ standard test accuracy with $99.9\%$ standard training accuracy on CIFAR-10 dataset), but their poisonings can be easily removed when adopting adversarial training. Therefore, we aim to design a new type of inducing noise, named ADVIN, which is an irremovable poisoning of training data. ADVIN can not only degrade the robustness of adversarial training by a large margin, for example, from $51.7\%$ to $0.57\%$ on CIFAR-10 dataset, but also be effective for fooling standard training ($13.1\%$ standard test accuracy with $100\%$ standard training accuracy). Additionally, ADVIN can be applied to preventing personal data (like selfies) from being exploited without authorization under whether standard or adversarial training.
翻译:广泛认为,Aversari培训是一种可靠的方法,可以改进对抗性攻击的示范性强健性,然而,在本文中,我们表明,如果就一种类型的有毒数据进行培训,对抗性培训也可能被愚弄,产生灾难性行为,例如,在CIFAR-10数据集方面,1美元强力测试精度,90美元强力测试精度,CIFAR-10数据集以前,培训数据中有其他类型的噪音,成功地欺骗了标准培训(15.8美元标准测试精度,CIFAR-10数据集为99.9美元标准培训精度),但是在采用对抗性培训时,它们的中毒很容易消除。因此,我们的目标是设计一种新型的诱发噪音,名为ADVIN,这是培训数据的不可去除的毒害。ADVIN不仅能大大降低对抗性培训的稳健性,例如,从51.7美元到0.57美元,在CIFAR-10数据集方面,而且还能有效地愚弄标准培训(13.1美元标准测试精度,100美元标准训练精度)。 此外,ADVI可以用来防止个人数据在标准下被利用。