Data valuation is a ML field that studies the value of training instances towards a given predictive task. Although data bias is one of the main sources of downstream model unfairness, previous work in data valuation does not consider how training instances may influence both performance and fairness of ML models. Thus, we propose Fairness-Aware Data vauatiOn (FADO), a data valuation framework that can be used to incorporate fairness concerns into a series of ML-related tasks (e.g., data pre-processing, exploratory data analysis, active learning). We propose an entropy-based data valuation metric suited to address our two-pronged goal of maximizing both performance and fairness, which is more computationally efficient than existing metrics. We then show how FADO can be applied as the basis for unfairness mitigation pre-processing techniques. Our methods achieve promising results -- up to a 40 p.p. improvement in fairness at a less than 1 p.p. loss in performance compared to a baseline -- and promote fairness in a data-centric way, where a deeper understanding of data quality takes center stage.
翻译:数据估值是一个研究训练数据实例对于给定预测任务价值的机器学习领域。虽然数据偏差是数据模型不公平的主要来源之一,但以前的数据估值工作并没有考虑训练实例可能如何影响机器学习模型的性能和公平性。因此,我们提出了公正感知数据估值(FADO),这是一个数据估值框架,可用于将公平性问题纳入一系列与机器学习相关的任务中(例如数据预处理,探索性数据分析,主动学习)。我们提出了基于熵的数据估值度量,适用于解决我们的双重目标,即最大化性能和公平性,该度量比现有度量更具计算效率。然后,我们展示了如何将FADO应用作不公平性缓解预处理技术的基础。我们的方法取得了有前途的结果--相对于基线,在性能损失不到1个百分点的情况下,公平性提高了多达40个百分点--并以数据为中心促进公平性,其中对数据质量的深入理解处于核心地位。