Progress in making neural networks more robust against adversarial attacks is mostly marginal, despite the great efforts of the research community. Moreover, the robustness evaluation is often imprecise, making it difficult to identify promising approaches. We analyze the classification decisions of 19 different state-of-the-art neural networks trained to be robust against adversarial attacks. Our findings suggest that current untargeted adversarial attacks induce misclassification towards only a limited amount of different classes. Additionally, we observe that both over- and under-confidence in model predictions result in an inaccurate assessment of model robustness. Based on these observations, we propose a novel loss function for adversarial attacks that consistently improves attack success rate compared to prior loss functions for 19 out of 19 analyzed models.
翻译:尽管研究界作出了巨大努力,但在使神经网络对对抗性攻击更加强大方面进展不大。此外,强性评价往往不准确,难以确定有希望的方法。我们分析了19个经过训练后对对抗性攻击更加强大的最先进的神经网络的分类决定。我们的调查结果表明,目前非目标性对抗性攻击只导致不同等级的错误分类。此外,我们注意到,在模型预测中的过量和不自信导致对模型强性作出不准确的评估。根据这些观察,我们提议对对抗性攻击采用新的损失功能,这种功能与19个分析模型中的19个相比,不断提高攻击性攻击成功率。