The prediction of certifiably robust classifiers remains constant around a neighborhood of a point, making them resilient to test-time attacks with a guarantee. In this work, we present a previously unrecognized threat to robust machine learning models that highlights the importance of training-data quality in achieving high certified robustness. Specifically, we propose a novel bilevel optimization based data poisoning attack that degrades the robustness guarantees of certifiably robust classifiers. Unlike other data poisoning attacks that reduce the accuracy of the poisoned models on a small set of target points, our attack reduces the average certified radius of an entire target class in the dataset. Moreover, our attack is effective even when the victim trains the models from scratch using state-of-the-art robust training methods such as Gaussian data augmentation\cite{cohen2019certified}, MACER\cite{zhai2020macer}, and SmoothAdv\cite{salman2019provably}. To make the attack harder to detect we use clean-label poisoning points with imperceptibly small distortions. The effectiveness of the proposed method is evaluated by poisoning MNIST and CIFAR10 datasets and training deep neural networks using the previously mentioned robust training methods and certifying their robustness using randomized smoothing. For the models trained with these robust training methods our attack points reduce the average certified radius of the target class by more than 30% and are transferable to models with different architectures and models trained with different robust training methods.
翻译:在这项工作中,我们向强健的机器学习模型展示了一种先前不为人知的威胁,这些模型凸显了培训数据质量在实现高认证稳健度方面的重要性。具体地说,我们提议了一种新的双级优化数据中毒袭击,这降低了可验证稳健的分类器的稳健性保障。与其他降低一组目标点上有毒模型准确性的数据中毒袭击不同,我们的攻击降低了数据集中整个目标类中经测试袭击的经认证的平均半径。此外,即使受害者利用高斯数据增强值{cite{cohen2019认证}、MACER\cite{zhai2020macer}和滑动Adv\cite{salman2019可预期}等最可靠的培训方法从头到脚踏实的机器学习模型,我们使用经过精密的清洁标签中毒中毒模型来检测数据。此外,即使受害者用最强的、最精确的、最精确的模型来训练模型,也能够有效地评估我们的拟议方法的有效性。