A recent explosion of research focuses on developing methods and tools for building fair predictive models. However, most of this work relies on the assumption that the training and testing data are representative of the target population on which the model will be deployed. However, real-world training data often suffer from selection bias and are not representative of the target population for many reasons, including the cost and feasibility of collecting and labeling data, historical discrimination, and individual biases. In this paper, we introduce a new framework for certifying and ensuring the fairness of predictive models trained on biased data. We take inspiration from query answering over incomplete and inconsistent databases to present and formalize the problem of consistent range approximation (CRA) of answers to queries about aggregate information for the target population. We aim to leverage background knowledge about the data collection process, biased data, and limited or no auxiliary data sources to compute a range of answers for aggregate queries over the target population that are consistent with available information. We then develop methods that use CRA of such aggregate queries to build predictive models that are certifiably fair on the target population even when no external information about that population is available during training. We evaluate our methods on real data and demonstrate improvements over state of the art. Significantly, we show that enforcing fairness using our methods can lead to predictive models that are not only fair, but more accurate on the target population.
翻译:最近一项研究的爆炸侧重于为建立公平的预测模型制定方法和工具,然而,这项工作的大部分依据的假设是,培训和测试数据代表了将采用该模型的目标人口,然而,现实世界培训数据往往具有选择偏差,由于许多原因,包括收集和标注数据的成本和可行性、历史歧视和个人偏见,不代表目标人口,因此,最近的一项研究的爆炸集中于为建立建立公平的预测模型而开发新的框架。我们从回答不完整和不一致的数据库的查询中汲取灵感,以便提出和正式确定对关于目标人口综合信息的一致范围近似(CRA)问题的答案。我们的目标是利用关于数据收集过程的背景知识、偏差数据以及有限的或没有公正的辅助数据来源,以便根据现有信息,对关于目标人口的总体询问作出一系列答复。我们然后制定方法,利用这种汇总查询来建立预测模型,这些模型可以肯定地公平对待目标人口,即使没有关于该人口的任何外部信息,但我们只能评估我们关于目标人口的真实数据的方法,只能用更准确的方法来预测。我们只能用更准确的方法来预测,我们只能用更准确的方法来判断我们的目标人口状况。