Counterfactual, serving as one emerging type of model explanation, has attracted tons of attentions recently from both industry and academia. Different from the conventional feature-based explanations (e.g., attributions), counterfactuals are a series of hypothetical samples which can flip model decisions with minimal perturbations on queries. Given valid counterfactuals, humans are capable of reasoning under ``what-if'' circumstances, so as to better understand the model decision boundaries. However, releasing counterfactuals could be detrimental, since it may unintentionally leak sensitive information to adversaries, which brings about higher risks on both model security and data privacy. To bridge the gap, in this paper, we propose a novel framework to generate differentially private counterfactual (DPC) without touching the deployed model or explanation set, where noises are injected for protection while maintaining the explanation roles of counterfactual. In particular, we train an autoencoder with the functional mechanism to construct noisy class prototypes, and then derive the DPC from the latent prototypes based on the post-processing immunity of differential privacy. Further evaluations demonstrate the effectiveness of the proposed framework, showing that DPC can successfully relieve the risks on both extraction and inference attacks.
翻译:反事实是一种新兴的模型解释,最近引起了产业界和学术界的注意。反事实与传统的基于特征的解释(例如归属)不同,反事实是一系列假设的样本,这些样本可以将示范决定翻转,对询问的干扰最小。鉴于有效的反事实,人类可以在“万一”情况下进行推理,以便更好地了解示范决定界限。然而,释放反事实可能是有害的,因为它可能无意地向对手泄漏敏感信息,从而给模型安全和数据隐私带来更大的风险。为了弥合这一差距,我们在本文件中提出了一个新的框架,在不触及已部署的模式或解释组合的情况下,产生差异性私人反事实(DPC),在其中,噪音被注入保护,同时保留反事实的解释作用。特别是,我们用功能机制培训一个自动编码器,以构建噪音类原型,然后从基于处理后隐私豁免的潜伏原型中获取DPC。进一步评估显示拟议框架的有效性,显示DPC能够成功减轻攻击的风险,同时显示DPC能够成功减轻攻击的风险。