Evaluating robustness of machine-learning models to adversarial examples is a challenging problem. Many defenses have been shown to provide a false sense of security by causing gradient-based attacks to fail, and they have been broken under more rigorous evaluations. Although guidelines and best practices have been suggested to improve current adversarial robustness evaluations, the lack of automatic testing and debugging tools makes it difficult to apply these recommendations in a systematic manner. In this work, we overcome these limitations by (i) defining a set of quantitative indicators which unveil common failures in the optimization of gradient-based attacks, and (ii) proposing specific mitigation strategies within a systematic evaluation protocol. Our extensive experimental analysis shows that the proposed indicators of failure can be used to visualize, debug and improve current adversarial robustness evaluations, providing a first concrete step towards automatizing and systematizing current adversarial robustness evaluations. Our open-source code is available at: https://github.com/pralab/IndicatorsOfAttackFailure.
翻译:评估机器学习模式对对抗性攻击的稳健性是一个具有挑战性的问题,许多防御手段已证明通过导致梯度攻击失败而提供了虚假的安全感,而且这些防御手段在更严格的评价下被打破。虽然已建议了改进当前对抗性强度评价的准则和最佳做法,但缺乏自动测试和调试工具使得难以系统地应用这些建议。在这项工作中,我们克服了这些限制,办法是:(一) 确定一套量化指标,以揭示在优化梯度攻击方面常见的失败;(二) 在系统评价协议中提出具体的缓解战略。我们的广泛实验分析表明,拟议的失败指标可以用来对当前对抗性强度评价进行视觉化、调试和改进目前的对抗性强度评价,为目前对抗性强度评价自动化和系统化提供了第一步具体步骤。我们的开放源代码可在https://github.com/pralab/IndicatorofAttackFailure查阅:https://github.com/pralab/Indicatorfat-AttackFailureure。