Adversarial examples pose a security risk as they can alter decisions of a machine learning classifier through slight input perturbations. Certified robustness has been proposed as a mitigation where given an input $x$, a classifier returns a prediction and a radius with a provable guarantee that any perturbation to $x$ within this radius (e.g., under the $L_2$ norm) will not alter the classifier's prediction. In this work, we show that these guarantees can be invalidated due to limitations of floating-point representation that cause rounding errors. We design a rounding search method that can efficiently exploit this vulnerability to find adversarial examples within the certified radius. We show that the attack can be carried out against several linear classifiers that have exact certifiable guarantees and against neural networks with ReLU activations that have conservative certifiable guarantees. Our experiments demonstrate attack success rates over 50% on random linear classifiers, up to 23.24% on the MNIST dataset for linear SVM, and up to 15.83% on the MNIST dataset for a neural network whose certified radius was given by a verifier based on mixed integer programming. Finally, as a mitigation, we advocate the use of rounded interval arithmetic to account for rounding errors.
翻译:Aversari 示例构成一种安全风险,因为它们可以通过轻微的输入扰动来改变机器学习分类师的决定。 在给一个输入值为$x$的情况下,建议了经过认证的稳健度,作为缓解措施。 在给一个输入值为$x$的情况下,一个分类员返回了一个预测和一个半径,并可以确认保证在这个半径内,任何扰动到$x$x美元(例如,根据$L_2的规范)不会改变分类师的预测。在这项工作中,我们证明这些担保可以无效,因为浮动点代表制的限制导致四舍五入错误。我们设计了一个四舍五入的搜索方法,可以有效地利用这种脆弱性来在经认证的半径内找到对抗性实例。我们表明,袭击可以针对几个有精确验证保证的线性分类师进行预测,也可以针对有保守的验证性保证的神经网络。我们的实验表明,随机线性分类师的进攻成功率超过50%,对于线性SVM的MNIST数据集高达23.24%,而MISD数据设置的15.83%用于神经网络的神经学数据库数据设置,而我们最后的算算为经认证的日历的模拟的模拟校验算。