Missing data are prevalent and present daunting challenges in real data analysis. While there is a growing body of literature on fairness in analysis of fully observed data, there has been little theoretical work on investigating fairness in analysis of incomplete data. In practice, a popular analytical approach for dealing with missing data is to use only the set of complete cases, i.e., observations with all features fully observed to train a prediction algorithm. However, depending on the missing data mechanism, the distribution of complete cases and the distribution of the complete data may be substantially different. When the goal is to develop a fair algorithm in the complete data domain where there are no missing values, an algorithm that is fair in the complete case domain may show disproportionate bias towards some marginalized groups in the complete data domain. To fill this significant gap, we study the problem of estimating fairness in the complete data domain for an arbitrary model evaluated merely using complete cases. We provide upper and lower bounds on the fairness estimation error and conduct numerical experiments to assess our theoretical results. Our work provides the first known theoretical results on fairness guarantee in analysis of incomplete data.
翻译:在实际数据分析中,缺少的数据十分普遍,而且构成严峻的挑战。虽然关于分析充分观察到的数据的公正性的文献越来越多,但在调查分析不完全数据方面的公正性方面却很少进行理论工作。在实践中,处理缺失数据的流行分析方法是只使用一套完整的案例,即所有特征都完全观察到的观察来训练预测算法。然而,根据缺失的数据机制,完整案例的分布和完整数据的分配可能大不相同。在完全数据领域没有缺失值的情况下,目标是在完整数据领域发展一种公平的算法,而在完整数据领域,一种公平的算法可能显示在完全数据领域对某些边缘化群体存在不相称的偏向。为填补这一重大空白,我们研究在完全数据领域对仅使用完整案例来评价的任意模型估计公正性的问题。我们提供了公平估计错误的上限和下限,并进行数字实验来评估我们的理论结果。我们的工作在分析不完全数据时,在公平性保障方面提供了第一个已知的理论结果。