Data-driven algorithms are being studied and deployed in diverse domains to support critical decisions, directly impacting on people's well-being. As a result, a growing community of algorithmic fairness researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of the risks and opportunities of automated decision-making for different populations. Algorithmic fairness progress hinges on data, which can be used appropriately only if adequately documented. Unfortunately, the algorithmic fairness community, as a whole, suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity). In this work, we survey over two hundred datasets employed in algorithmic fairness research, producing standardized and searchable documentation for each of them, along with in-depth documentation for the three most popular fairness datasets, namely Adult, COMPAS and German Credit. These documentation efforts support multiple contributions. Firstly, we summarize the merits and limitations of popular algorithmic fairness datasets, questioning their suitability as general-purpose fairness benchmarks. Secondly, we document hundreds of available alternatives, annotating their domain and supported fairness tasks, to assist dataset users in task-oriented and domain-oriented search. Finally, we analyze these resources from the perspective of five important data curation topics: anonymization, consent, inclusivity, labeling of sensitive attributes, and transparency. We discuss different approaches and levels of attention to these topics, making them tangible, and distill them into a set of best practices for the curation of novel datasets.
翻译:由数据驱动的算法正在不同的领域研究和部署,以支持关键决策,直接影响到人们的福祉。因此,越来越多的算法公平研究人员一直在调查现有算法的公平性,并提出新的算法,增进对不同人口自动决策的风险和机会的了解。算法的公平性进展取决于数据,只有在充分记录的情况下才能适当使用这些数据。不幸的是,算法公平性社区作为一个整体,由于缺乏关于具体资源(不透明)和现有信息分散(差异)的信息,而存在集体数据文件债务。在这项工作中,我们调查了200多套用于算法公平性研究的数据集,为每个数据集制作了标准化和可搜索的文件,同时深入记录了三种最受欢迎的公平性数据集,即成人、COMPAS和德国信用。这些文件工作支持了多种贡献。首先,我们总结了大众算法公平性数据集的优点和局限性,质疑其作为一般用途公平性基准的合适性。 其次,我们记录了数百种可用的替代数据,说明其种替代方法,在算法公平性研究中使用了200多套数据,为每个数据集制作标准化的可搜索的域,支持了这些重要域的域域域,从分析数据,从以分析它们为目的的分类的分类的分类,我们从分析了这些重要的域分析了重要数据,从分析了重要的域域的分类的分类,从分析了它们。