In this work, we study the black-box targeted attack problem from the model discrepancy perspective. On the theoretical side, we present a generalization error bound for black-box targeted attacks, which gives a rigorous theoretical analysis for guaranteeing the success of the attack. We reveal that the attack error on a target model mainly depends on empirical attack error on the substitute model and the maximum model discrepancy among substitute models. On the algorithmic side, we derive a new algorithm for black-box targeted attacks based on our theoretical analysis, in which we additionally minimize the maximum model discrepancy(M3D) of the substitute models when training the generator to generate adversarial examples. In this way, our model is capable of crafting highly transferable adversarial examples that are robust to the model variation, thus improving the success rate for attacking the black-box model. We conduct extensive experiments on the ImageNet dataset with different classification models, and our proposed approach outperforms existing state-of-the-art methods by a significant margin. Our codes will be released.
翻译:在这项工作中,我们从模型差异的角度来研究黑盒定向袭击问题。 在理论方面,我们提出了一个针对黑盒定向袭击的概括错误,为黑盒定向袭击提供了严格的理论分析,以保证袭击的成功。我们发现,目标模型的攻击错误主要取决于替代模型上的经验攻击错误和替代模型之间最大的模型差异。在算法方面,我们根据我们的理论分析,为黑盒定向袭击得出一种新的算法,其中我们在培训发电机以生成对抗性实例时,将替代模型的最大模型差异(M3D)进一步降到最低。这样,我们的模型能够设计出与模型变异高度可转移的对抗性范例,从而提高攻击黑盒模型的成功率。我们用不同的分类模型对图像网络数据集进行广泛的实验,我们提出的方法将大大超出现有最先进的方法。我们的代码将被释放。