Deep neural networks are vulnerable to adversarial examples (AEs), which have adversarial transferability: AEs generated for the source model can mislead another (target) model's predictions. However, the transferability has not been understood in terms of to which class target model's predictions were misled (i.e., class-aware transferability). In this paper, we differentiate the cases in which a target model predicts the same wrong class as the source model ("same mistake") or a different wrong class ("different mistake") to analyze and provide an explanation of the mechanism. We find that (1) AEs tend to cause same mistakes, which correlates with "non-targeted transferability"; however, (2) different mistakes occur even between similar models, regardless of the perturbation size. Furthermore, we present evidence that the difference between same mistakes and different mistakes can be explained by non-robust features, predictive but human-uninterpretable patterns: different mistakes occur when non-robust features in AEs are used differently by models. Non-robust features can thus provide consistent explanations for the class-aware transferability of AEs.
翻译:深神经网络易受敌对性例子(AEs)的伤害,这些例子具有对抗性可转移性:源模型产生的AEs可能会误导另一个(目标)模型的预测。然而,对于可转移性,并没有从哪个类目标模型的预测被误导的角度来理解(即,等级认知可转移性)。在本文中,我们区分了目标模型预测与源模型(“相同错误”)或不同类别(不同错误”错误)相同的错误类别(不同错误)来分析和解释机制。我们发现(1) AEs往往造成与“非目标转移性”相关联的相同错误;但是,(2) 即使在类似模式之间也发生不同的错误,而不论扰动大小。此外,我们提出的证据表明,相同的错误和不同错误之间的差别可以用非交错特征、预测性但人类无法相互调和模式来解释:在AEs的非交错特性被不同地使用时会发生不同的错误。因此,非交错性特征可以为A-E的可转移性提供一致的解释。