Adversarial attack transferability is well-recognized in deep learning. Prior work has partially explained transferability by recognizing common adversarial subspaces and correlations between decision boundaries, but little is known beyond this. We propose that transferability between seemingly different models is due to a high linear correlation between the feature sets that different networks extract. In other words, two models trained on the same task that are distant in the parameter space likely extract features in the same fashion, just with trivial affine transformations between the latent spaces. Furthermore, we show how applying a feature correlation loss, which decorrelates the extracted features in a latent space, can reduce the transferability of adversarial attacks between models, suggesting that the models complete tasks in semantically different ways. Finally, we propose a Dual Neck Autoencoder (DNA), which leverages this feature correlation loss to create two meaningfully different encodings of input information with reduced transferability.
翻译:在深层的学习中,对反向攻击可转移性得到了广泛认识。 先前的工作通过承认共同的对抗性子空间和决定边界间的相关性而部分地解释了可转移性, 但除此之外却鲜为人知。 我们建议,看似不同的模型之间的可转移性是由于不同网络所提取的特征组之间高度的线性相关关系。 换句话说, 两个在参数空间可能很遥远的相同任务上受过训练的模型可能以同样的方式提取特征, 只是在潜在空间之间进行微小的平方形变形。 此外, 我们展示了如何应用地物相关损失来降低模型之间对抗性攻击的可转移性, 这表明这些模型以不同的方式完成了任务。 最后, 我们建议使用一个双 Neck Autoencoder (DNA), 利用此特性相关损失来创建两个截然不同的输入信息编码, 从而降低可转移性。