The underlying assumption of many machine learning algorithms is that the training data and test data are drawn from the same distributions. However, the assumption is often violated in real world due to the sample selection bias between the training and test data. Previous research works focus on reweighing biased training data to match the test data and then building classification models on the reweighed training data. However, how to achieve fairness in the built classification models is under-explored. In this paper, we propose a framework for robust and fair learning under sample selection bias. Our framework adopts the reweighing estimation approach for bias correction and the minimax robust estimation approach for achieving robustness on prediction accuracy. Moreover, during the minimax optimization, the fairness is achieved under the worst case, which guarantees the model's fairness on test data. We further develop two algorithms to handle sample selection bias when test data is both available and unavailable. We conduct experiments on two real-world datasets and the experimental results demonstrate its effectiveness in terms of both utility and fairness metrics.
翻译:许多机器学习算法的基本假设是,培训数据和测试数据来自同样的分布,但是,由于培训和测试数据之间的抽样选择偏差,这种假设在现实世界中常常被违反。以前的研究工作侧重于重新权衡有偏向的培训数据,以便与测试数据匹配,然后根据重新缩放的培训数据建立分类模型。然而,如何在构建的分类模型中实现公平,没有得到充分探讨。在本文中,我们提出了一个在抽样选择偏差下进行稳健和公平学习的框架。我们的框架采用了偏差纠正的重新加权估计方法和实现稳健预测准确性微缩稳健估计方法。此外,在微缩最大优化期间,在最差的情况下实现了公平,这保证了模型对测试数据的公平性。我们进一步开发了两种算法,以便在既有测试数据可用,又缺乏测试数据时处理抽样选择偏差。我们在两个真实世界的数据集和实验结果上进行了实验,显示了其效用和公正性。