Adversarial attack transferability is a well-recognized phenomenon in deep learning. Prior work has partially explained transferability by recognizing common adversarial subspaces and correlations between decision boundaries, but we have found little explanation in the literature beyond this. In this paper, we propose that transferability between seemingly different models is due to a high linear correlation between features that different deep neural networks extract. In other words, two models trained on the same task that are seemingly distant in the parameter space likely extract features in the same fashion, just with trivial shifts and rotations between the latent spaces. Furthermore, we show how applying a feature correlation loss, which decorrelates the extracted features in a latent space, can drastically reduce the transferability of adversarial attacks between models, suggesting that the models complete tasks in semantically different ways. Finally, we propose a Dual Neck Autoencoder (DNA), which leverages this feature correlation loss to create two meaningfully different encodings of input information with reduced transferability.
翻译:反向攻击可转移性是深层学习中公认的一种现象。 先前的工作通过承认共同的对立子空间和决定边界间的相关性而部分地解释了可转移性,但我们在文献中发现除此以外几乎没有什么解释。 在本文中,我们建议,看似不同的模型之间的可转移性是由于不同深神经网络提取的特征之间的高度线性关联。 换句话说, 两个在参数空间似乎距离遥远的相同任务上受过培训的模型可能以同样的方式提取特征, 只是在潜伏空间之间进行微小的移动和旋转。 此外, 我们展示了如何应用一个特征关联性损失, 能够大大降低模型之间对抗性攻击的可转移性, 这表明这些模型以不同的方式完成了任务。 最后, 我们提议使用一个双式 Neck Autogencoder (DNA), 利用这一特征相关损失来生成两个截然不同的输入信息编码, 降低可转移性。