Algorithmic risk assessments are increasingly used to make and inform decisions in a wide variety of high-stakes settings. In practice, there is often a multitude of predictive models that deliver similar overall performance, an empirical phenomenon commonly known as the "Rashomon Effect." While many competing models may perform similarly overall, they may have different properties over various subgroups, and therefore have drastically different predictive fairness properties. In this paper, we develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance, or "the set of good models." We provide tractable algorithms to compute the range of attainable group-level predictive disparities and the disparity minimizing model over the set of good models. We extend our framework to address the empirically relevant challenge of selectively labelled data in the setting where the selection decision and outcome are unconfounded given the observed data features. We illustrate our methods in two empirical applications. In a real world credit-scoring task, we build a model with lower predictive disparities than the benchmark model, and demonstrate the benefits of properly accounting for the selective labels problem. In a recidivism risk prediction task, we audit an existing risk score, and find that it generates larger predictive disparities than any model in the set of good models.
翻译:分析风险评估被越来越多地用于在各种高取量环境中做出决策,并为决策提供依据。在实践中,常常有许多预测模型可以提供类似的总体业绩。在实践上,我们提供可移植的算法,以计算可实现群体一级预测差异的范围,并尽可能缩小该套好模型的不平等模式。我们扩展我们的框架,以应对在选择决定和结果缺乏根据的环境下,有选择地贴有标签的数据这一与经验相关的挑战。我们用两种经验应用来说明我们的方法。在现实世界的信用对比中,我们构建了一个比基准模型差得多的模型,并展示正确计算选择性标签问题的好处。在一项累进风险预测任务中,我们审计了现有风险分数,并评估了现有风险分数。