Model attribution is a critical component of deep neural networks (DNNs) for its interpretability to complex models. Recent studies bring up attention to the security of attribution methods as they are vulnerable to attribution attacks that generate similar images with dramatically different attributions. Existing works have been investigating empirically improving the robustness of DNNs against those attacks; however, none of them explicitly quantifies the actual deviations of attributions. In this work, for the first time, a constrained optimization problem is formulated to derive an upper bound that measures the largest dissimilarity of attributions after the samples are perturbed by any noises within a certain region while the classification results remain the same. Based on the formulation, different practical approaches are introduced to bound the attributions above using Euclidean distance and cosine similarity under both $\ell_2$ and $\ell_\infty$-norm perturbations constraints. The bounds developed by our theoretical study are validated on various datasets and two different types of attacks (PGD attack and IFIA attribution attack). Over 10 million attacks in the experiments indicate that the proposed upper bounds effectively quantify the robustness of models based on the worst-case attribution dissimilarities.
翻译:模型属性是深神经网络(DNNs)的一个关键组成部分,因为它可以解释复杂的模型。最近的研究使人们注意到归属方法的安全性,因为这些方法很容易受到归属攻击,而归属攻击则产生不同特性的类似图像。现有工作一直在从经验上研究如何提高指定NNs抵御这些攻击的稳健性;然而,没有一项工作明确量化属性的实际偏差。在这项工作中,首次形成了一个有限的优化问题,以得出一个上限,即测量在样品被某一区域内的任何噪音干扰后的最大差异性,而分类结果则保持不变。根据这一公式,采用了不同的实用方法,将上述属性与Euclidean距离和相近性联系起来,在$_2美元和$\ellinfty$-norm perturbation限制下,但其中没有一项明确量化这些属性的实际偏差。我们的理论研究所形成的界限是对各种数据集和两种不同类型的攻击(PGD攻击和IFIA的归属攻击)进行验证。在实验中,1 000多万次以上的攻击表明,根据最差的模型,最差的归属有效地量化了最差的模型。</s>