As decision-making increasingly relies on machine learning and (big) data, the issue of fairness in data-driven AI systems is receiving increasing attention from both research and industry. A large variety of fairness-aware machine learning solutions have been proposed which propose fairness-related interventions in the data, learning algorithms and/or model outputs. However, a vital part of proposing new approaches is evaluating them empirically on benchmark datasets that represent realistic and diverse settings. Therefore, in this paper, we overview real-world datasets used for fairness-aware machine learning. We focus on tabular data as the most common data representation for fairness-aware machine learning. We start our analysis by identifying relationships among the different attributes, particularly w.r.t. protected attributes and class attributes, using a Bayesian network. For a deeper understanding of bias and fairness in the datasets, we investigate the interesting relationships using exploratory analysis.
翻译:由于决策日益依赖机器学习和(大)数据,数据驱动的人工智能系统中的公平问题正日益受到研究和行业的注意。提出了各种公平意识的机器学习解决方案,提出在数据、学习算法和(或)模型产出中采取与公平有关的干预措施。然而,提出新办法的一个重要部分是用经验来评价它们,这些基准数据集代表现实和多样化的环境。因此,我们在本文件中概述用于公平意识机器学习的真实世界数据集。我们注重表格数据,将其作为公平意识机器学习的最常见数据代表。我们开始进行分析,方法是利用拜叶斯网络查明不同属性之间的关系,特别是受保护的属性和阶级属性。为了更深入地了解数据集中的偏差和公平性,我们利用探索性分析来调查有趣的关系。