Research in machine learning fairness has historically considered a single binary demographic attribute; however, the reality is of course far more complicated. In this work, we grapple with questions that arise along three stages of the machine learning pipeline when incorporating intersectionality as multiple demographic attributes: (1) which demographic attributes to include as dataset labels, (2) how to handle the progressively smaller size of subgroups during model training, and (3) how to move beyond existing evaluation metrics when benchmarking model fairness for more subgroups. For each question, we provide thorough empirical evaluation on tabular datasets derived from the US Census, and present constructive recommendations for the machine learning community. First, we advocate for supplementing domain knowledge with empirical validation when choosing which demographic attribute labels to train on, while always evaluating on the full set of demographic attributes. Second, we warn against using data imbalance techniques without considering their normative implications and suggest an alternative using the structure in the data. Third, we introduce new evaluation metrics which are more appropriate for the intersectional setting. Overall, we provide substantive suggestions on three necessary (albeit not sufficient!) considerations when incorporating intersectionality into machine learning.
翻译:机器学习公平性研究历来考虑到单一的二元人口属性;然而,现实当然要复杂得多。 在这项工作中,我们处理机器学习管道的三个阶段在将交叉性作为多重人口属性时出现的问题:(1) 哪些人口属性包括成数据集标签,(2) 如何在模式培训中处理规模逐渐缩小的分组,(3) 当为更多分组制定基准时,如何超越现有的评价指标,为更多的分组制定公平性模型基准时,如何超越现有的评价指标。关于每个问题,我们提供美国人口普查得出的表格数据集的全面经验性评价,并为机器学习界提出建设性建议。首先,我们提倡在选择哪些人口属性标签用于培训时,以经验性验证补充域知识,同时始终评价整套人口属性。第二,我们警告不要在不考虑数据规范影响的情况下使用数据不平衡技术,并建议使用数据结构中的替代方法。第三,我们引入新的评价指标,更适合交叉性环境。总体而言,我们在将交叉性纳入机器学习时,我们就三个必要的(尽管不够充分)考虑因素提出实质性建议。