Fair classification aims to stress the classification models to achieve the equality (treatment or prediction quality) among different sensitive groups. However, fair classification can be under the risk of poisoning attacks that deliberately insert malicious training samples to manipulate the trained classifiers' performance. In this work, we study the poisoning scenario where the attacker can insert a small fraction of samples into training data, with arbitrary sensitive attributes as well as other predictive features. We demonstrate that the fairly trained classifiers can be greatly vulnerable to such poisoning attacks, with much worse accuracy & fairness trade-off, even when we apply some of the most effective defenses (originally proposed to defend traditional classification tasks). As countermeasures to defend fair classification tasks, we propose a general and theoretically guaranteed framework which accommodates traditional defense methods to fair classification against poisoning attacks. Through extensive experiments, the results validate that the proposed defense framework obtains better robustness in terms of accuracy and fairness than representative baseline methods.
翻译:公平分类的目的是强调不同敏感群体之间实现平等(待遇或预测质量)的分类模式;然而,公平的分类可能处于中毒攻击的风险之下,故意插入恶意训练样本,以操纵受过训练的分类员的性能;在这项工作中,我们研究中毒情况,攻击者可以将一小部分样本插入培训数据,具有任意的敏感特征和其他预测特征;我们证明,经过适当训练的分类员极易受到这种中毒攻击的伤害,其准确性和公平性得失要差得多,即使我们采用一些最有效的防御手段(最初提议维护传统的分类任务);作为捍卫公平分类任务的反措施,我们提出了一个在理论上有保障的一般框架,其中考虑到传统的防御方法,以公平分类防止中毒攻击;通过广泛的实验,结果证实,拟议的防御框架在准确性和公正性方面比有代表性的基线方法更可靠。