This paper aims to explain adversarial attacks in terms of how adversarial perturbations contribute to the attacking task. We estimate attributions of different image regions to the decrease of the attacking cost based on the Shapley value. We define and quantify interactions among adversarial perturbation pixels, and decompose the entire perturbation map into relatively independent perturbation components. The decomposition of the perturbation map shows that adversarially-trained DNNs have more perturbation components in the foreground than normally-trained DNNs. Moreover, compared to the normally-trained DNN, the adversarially-trained DNN have more components which mainly decrease the score of the true category. Above analyses provide new insights into the understanding of adversarial attacks.
翻译:本文旨在解释对抗性攻击如何有助于攻击任务。我们估计不同图像区域在降低攻击成本方面的属性,以Shapley值为基础。我们定义和量化对抗性扰动像素之间的相互作用,将整个扰动图分解成相对独立的扰动部分。扰动图的分解表明,经敌对性训练的DNN在地面上比通常训练的DNN多有扰动成分。此外,与通常受过训练的DNN相比,经敌对性训练的DNN有更多的组成部分主要降低了真实类别的得分。以上分析为了解对抗性攻击提供了新的见解。