Evaluating robustness of machine-learning models to adversarial examples is a challenging problem. Many defenses have been shown to provide a false sense of robustness by causing gradient-based attacks to fail, and they have been broken under more rigorous evaluations. Although guidelines and best practices have been suggested to improve current adversarial robustness evaluations, the lack of automatic testing and debugging tools makes it difficult to apply these recommendations in a systematic manner. In this work, we overcome these limitations by: (i) categorizing attack failures based on how they affect the optimization of gradient-based attacks, while also unveiling two novel failures affecting many popular attack implementations and past evaluations; (ii) proposing six novel indicators of failure, to automatically detect the presence of such failures in the attack optimization process; and (iii) suggesting a systematic protocol to apply the corresponding fixes. Our extensive experimental analysis, involving more than 15 models in 3 distinct application domains, shows that our indicators of failure can be used to debug and improve current adversarial robustness evaluations, thereby providing a first concrete step towards automatizing and systematizing them. Our open-source code is available at: https://github.com/pralab/IndicatorsOfAttackFailure.
翻译:评估机器学习模式对对抗性实例的稳健性评估是一个具有挑战性的问题。许多辩护证明,通过造成基于梯度的攻击失败,提供了一种虚假的稳健感,这些辩护被打破了更为严格的评价。虽然提出了改进当前对抗性稳健性评价的指导方针和最佳做法,但缺乏自动测试和调试工具使得难以系统地应用这些建议。在这项工作中,我们克服了这些限制,办法是:(一)根据攻击失败如何影响梯度攻击的最佳性,对攻击失败进行分类,同时披露影响许多民众攻击执行和以往评价的两种新颖失败;(二)提出六项新的失败指标,以便自动发现攻击性最优化过程中出现的这种失败;以及(三)提出一个系统化的协议,以适用相应的修正。我们涉及三个不同应用领域的15个以上模型的广泛实验分析表明,我们的失败指标可用于调试和改进当前的对抗性强性评价,从而向自动化和系统化迈出了第一步。我们的开放源代码可查到:http://Fauthrub.Inbratators。