Fairwashing refers to the risk that an unfair black-box model can be explained by a fairer model through post-hoc explanations' manipulation. However, to realize this, the post-hoc explanation model must produce different predictions than the original black-box on some inputs, leading to a decrease in the fidelity imposed by the difference in unfairness. In this paper, our main objective is to characterize the risk of fairwashing attacks, in particular by investigating the fidelity-unfairness trade-off. First, we demonstrate through an in-depth empirical study on black-box models trained on several real-world datasets and for several statistical notions of fairness that it is possible to build high-fidelity explanation models with low unfairness. For instance, we find that fairwashed explanation models can exhibit up to $99.20\%$ fidelity to the black-box models they explain while being $50\%$ less unfair. These results suggest that fidelity alone should not be used as a proxy for the quality of black-box explanations. Second, we show that fairwashed explanation models can generalize beyond the suing group (\emph{i.e.}, data points that are being explained), which will only worsen as more stable fairness methods get developed. Finally, we demonstrate that fairwashing attacks can transfer across black-box models, meaning that other black-box models can perform fairwashing without explicitly using their predictions.
翻译:清洗黑匣子是指不公平黑盒模型有可能通过事后解释的操纵来解释不公平黑盒模型。 然而,为了实现这一点,事后解释模型必须产生与某些投入的原始黑盒不同的预测,导致因不公平而导致的忠诚程度下降。 在本文中,我们的主要目标是说明公平洗黑盒袭击的风险,特别是调查忠贞不公交易。首先,我们通过深入的经验性研究,对几个真实世界数据集和若干公平统计概念所培训的黑盒模型,表明有可能以低不公平的方式建立高忠诚解释模型。例如,我们发现,公平洗掉的解释模型可以展示出99.20美元对黑盒模型的忠诚程度,而其解释的不那么不公平。这些结果表明,光是诚信不应该被用来作为黑盒解释质量的代名词。第二,我们表明,公平洗掉的解释模型可以概括黑盒模型,而不只是在多个真实世界数据集之外建立高忠诚的解释,我们也可以用更公平的方式来解释。