We investigate the fairness concerns of training a machine learning model using data with missing values. Even though there are a number of fairness intervention methods in the literature, most of them require a complete training set as input. In practice, data can have missing values, and data missing patterns can depend on group attributes (e.g. gender or race). Simply applying off-the-shelf fair learning algorithms to an imputed dataset may lead to an unfair model. In this paper, we first theoretically analyze different sources of discrimination risks when training with an imputed dataset. Then, we propose an integrated approach based on decision trees that does not require a separate process of imputation and learning. Instead, we train a tree with missing incorporated as attribute (MIA), which does not require explicit imputation, and we optimize a fairness-regularized objective function. We demonstrate that our approach outperforms existing fairness intervention methods applied to an imputed dataset, through several experiments on real-world datasets.
翻译:我们调查了使用缺少值的数据培训机器学习模型的公平性问题。 尽管文献中存在一些公平干预方法, 但大部分都需要完整的培训作为投入。 实际上, 数据可能缺少价值, 数据缺失模式可能取决于群体属性( 如性别或种族 ) 。 简单地将现成的公平学习算法应用到估算数据集中, 可能导致不公平模式。 在本文中, 我们首先从理论上分析利用估算数据集培训时的不同歧视风险源。 然后, 我们提出基于决策树的综合办法, 不需要单独的估算和学习过程。 相反, 我们培训一个缺少属性的树( MIA), 不需要明确的估算, 我们优化公平化的目标功能。 我们通过对真实世界数据集的多次实验, 证明我们的方法超越了对估算数据集应用的现有公平干预方法。