Counterfactual explanations and adversarial attacks have a related goal: flipping output labels with minimal perturbations regardless of their characteristics. Yet, adversarial attacks cannot be used directly in a counterfactual explanation perspective, as such perturbations are perceived as noise and not as actionable and understandable image modifications. Building on the robust learning literature, this paper proposes an elegant method to turn adversarial attacks into semantically meaningful perturbations, without modifying the classifiers to explain. The proposed approach hypothesizes that Denoising Diffusion Probabilistic Models are excellent regularizers for avoiding high-frequency and out-of-distribution perturbations when generating adversarial attacks. The paper's key idea is to build attacks through a diffusion model to polish them. This allows studying the target model regardless of its robustification level. Extensive experimentation shows the advantages of our counterfactual explanation approach over current State-of-the-Art in multiple testbeds.
翻译:对抗攻击和反事实解释有一个共同的目标:通过最小扰动翻转输出标签,而不考虑它们的特性。然而,对抗攻击不能直接用于反事实解释的视角,因为这样的扰动被视为噪声而不是可操作和可理解的图像修改。本文在强健性学习文献的基础上,提出了一种优雅的方法,将对抗攻击转化为有意义的语义扰动,而无需修改分类器来进行解释。所提出的方法假设,去噪扩散概率模型是优秀的正则化器,避免了高频率和分布之外的扰动,当生成对抗攻击时,利用扩散模型来优化它们。这使得可以研究目标模型而不考虑其强健化水平。大量的实验表明,我们的反事实解释方法在多种测试中都优于当前的最先进技术。