When a model's performance differs across socially or culturally relevant groups--like race, gender, or the intersections of many such groups--it is often called "biased." While much of the work in algorithmic fairness over the last several years has focused on developing various definitions of model fairness (the absence of group-wise model performance disparities) and eliminating such "bias," much less work has gone into rigorously measuring it. In practice, it important to have high quality, human digestible measures of model performance disparities and associated uncertainty quantification about them that can serve as inputs into multi-faceted decision-making processes. In this paper, we show both mathematically and through simulation that many of the metrics used to measure group-wise model performance disparities are themselves statistically biased estimators of the underlying quantities they purport to represent. We argue that this can cause misleading conclusions about the relative group-wise model performance disparities along different dimensions, especially in cases where some sensitive variables consist of categories with few members. We propose the "double-corrected" variance estimator, which provides unbiased estimates and uncertainty quantification of the variance of model performance across groups. It is conceptually simple and easily implementable without statistical software package or numerical optimization. We demonstrate the utility of this approach through simulation and show on a real dataset that while statistically biased estimators of group-wise model performance disparities indicate statistically significant differences, when accounting for statistical bias in the estimator, the estimated between-group disparities are no longer statistically significant.
翻译:当模型的绩效在社会或文化相关群体(类似种族、性别或许多此类群体之间的交叉点)不同时,模型的绩效在社会或文化相关群体(类似种族、性别或许多此类群体)不同时,通常被称为“偏差”。虽然过去几年中在算法公平方面进行的许多工作都侧重于制定各种模型公平定义(没有群体间模式性业绩差异)和消除这种“偏差”的定义,但在严格衡量这种“偏差”方面却少得多。在实践中,重要的是要有高质量的、人可消化的模型性绩效差异计量标准,以及与之相关的不确定性量化标准,这些标准可以作为多面决策进程的投入。在本文件中,我们用数学和模拟方法表明,用于衡量群体间模式性业绩差异的许多衡量标准本身在统计上是有偏差的。我们争辩说,这可能会导致对相对群体间业绩差异产生误导性的结论,特别是在一些敏感变量由少数成员组成的情况下。我们建议“一次性校正”的估算数据差异,这些差异可以提供公正的估计数,并通过模拟方法表明,用以衡量不同群体间模式性业绩差异的许多衡量标准本身是有统计价值的。我们用一个简单的统计模型来展示一个简单的统计模型,在统计模型上显示,在统计上是简单的模型上没有统计上的准确的模型上显示一种统计性分析。