Predictions of certifiably robust classifiers remain constant in a neighborhood of a point, making them resilient to test-time attacks with a guarantee. In this work, we present a previously unrecognized threat to robust machine learning models that highlights the importance of training-data quality in achieving high certified adversarial robustness. Specifically, we propose a novel bilevel optimization-based data poisoning attack that degrades the robustness guarantees of certifiably robust classifiers. Unlike other poisoning attacks that reduce the accuracy of the poisoned models on a small set of target points, our attack reduces the average certified radius (ACR) of an entire target class in the dataset. Moreover, our attack is effective even when the victim trains the models from scratch using state-of-the-art robust training methods such as Gaussian data augmentation\cite{cohen2019certified}, MACER\cite{zhai2020macer}, and SmoothAdv\cite{salman2019provably} that achieve high certified adversarial robustness. To make the attack harder to detect, we use clean-label poisoning points with imperceptible distortions. The effectiveness of the proposed method is evaluated by poisoning MNIST and CIFAR10 datasets and training deep neural networks using previously mentioned training methods and certifying the robustness with randomized smoothing. The ACR of the target class, for models trained on generated poison data, can be reduced by more than 30\%. Moreover, the poisoned data is transferable to models trained with different training methods and models with different architectures.
翻译:对可靠可靠分类器的预测在一个点附近保持不变,使其具有测试时间攻击的弹性,并有保证。在这项工作中,我们对强健的机器学习模型提出了先前未承认的威胁,这些模型突显了培训-数据质量在实现高认证对抗性强健度方面的重要性。具体地说,我们提议了一个新的双级优化数据中毒袭击,降低了可靠可靠分类器的可靠性保障。不同于其他降低少量目标点上有毒模型准确度的中毒袭击,我们的攻击降低了数据集中整个目标类中经认证的平均半径(ACR ) 。此外,即使受害者利用最先进的可靠培训方法,如高斯数据增强{cohen2019认证}, MACER\cite{zhai2020macer}, 以及平滑的Adv\cite{salman2019colable}, 我们的攻击也降低了整个目标类中经认证的平均半径优度。我们使用清洁标签中毒模型, 也有效。我们用精确的准确性数据来训练模型来训练这些模型。