With the growing awareness to fairness in machine learning and the realization of the central role that data representation has in data processing tasks, there is an obvious interest in notions of fair data representations. The goal of such representations is that a model trained on data under the representation (e.g., a classifier) will be guaranteed to respect some fairness constraints. Such representations are useful when they can be fixed for training models on various different tasks and also when they serve as data filtering between the raw data (known to the representation designer) and potentially malicious agents that use the data under the representation to learn predictive models and make decisions. A long list of recent research papers strive to provide tools for achieving these goals. However, we prove that this is basically a futile effort. Roughly stated, we prove that no representation can guarantee the fairness of classifiers for different tasks trained using it; even the basic goal of achieving label-independent Demographic Parity fairness fails once the marginal data distribution shifts. More refined notions of fairness, like Odds Equality, cannot be guaranteed by a representation that does not take into account the task specific labeling rule with respect to which such fairness will be evaluated (even if the marginal data distribution is known a priory). Furthermore, except for trivial cases, no representation can guarantee Odds Equality fairness for any two different tasks, while allowing accurate label predictions for both. While some of our conclusions are intuitive, we formulate (and prove) crisp statements of such impossibilities, often contrasting impressions conveyed by many recent works on fair representations.
翻译:随着人们日益认识到机器学习的公平性,并意识到数据代表在数据处理任务中具有的核心作用,人们显然对公平数据表述的概念感兴趣。这种表述的目的是保证在代表机构(例如分类人员)下对数据进行训练的模型尊重一些公平性限制。当这种表述可以固定用于不同任务的培训模型时,当它们作为原始数据(代表设计人员所知道的)和可能恶意的代理人之间的数据过滤器时,当它们作为原始数据(代表设计人员所知道的)和可能使用代表机构的数据进行对比以学习预测模型和作出决定时,就会产生明显的兴趣。一份长长的近期研究论文清单试图为实现这些目标提供工具。然而,我们证明这基本上是徒劳无益的努力。粗略地说,我们证明任何代表机构都不能保证分类人员对所培训的不同任务的公平性;甚至当边际数据分布发生变化时,实现基于标签的公平性基本目标就会失败。更精确的公平性概念,如代表机构所知道的那样,通过不考虑具体任务说明规则的准确性来表达,无法保证更准确性。在近期评价这种公平性时,我们只能作出这样的准确性说明(即使只是要保证先前的比平面数据分布,除非有预言,我们所知道的标签,否则,否则会作出任何比重的数据分配。