Machine Learning (ML) models are susceptible to evasion attacks. Evasion accuracy is typically assessed using aggregate evasion rate, and it is an open question whether aggregate evasion rate enables feature-level diagnosis on the effect of adversarial perturbations on evasive predictions. In this paper, we introduce a novel framework that harnesses explainable ML methods to guide high-fidelity assessment of ML evasion attacks. Our framework enables explanation-guided correlation analysis between pre-evasion perturbations and post-evasion explanations. Towards systematic assessment of ML evasion attacks, we propose and evaluate a novel suite of model-agnostic metrics for sample-level and dataset-level correlation analysis. Using malware and image classifiers, we conduct comprehensive evaluations across diverse model architectures and complementary feature representations. Our explanation-guided correlation analysis reveals correlation gaps between adversarial samples and the corresponding perturbations performed on them. Using a case study on explanation-guided evasion, we show the broader usage of our methodology for assessing robustness of ML models.
翻译:机体学习(ML)模型容易遭到规避攻击。 通常利用总体规避率来评估逃逸准确性,这是一个未决问题,即总逃逸率是否有助于对对抗性干扰对蒸发预测的影响进行特征级诊断。 在本文中,我们引入了一个新颖的框架,利用可解释的 ML 方法来指导对逃逸攻击的高不忠性评估。我们的框架可以进行解释性指导性相关分析,分析潜逃前的扰动和逃逸后的解释。为了对逃逸攻击进行系统评估,我们提议并评价一套新型的样本级和数据集级相关分析模型----不可知度指标。我们使用恶意软件和图像分类方法,对各种模型结构进行全面评估,并进行互补特征描述。我们的解释性相关分析揭示了对冲抽样与对冲性攻击进行的相应扰动之间的关联性差距。我们用关于解释性规避的案例研究,展示了我们评估ML模型稳健性的方法的更广泛使用情况。