When a model's performance differs across socially or culturally relevant groups--like race, gender, or the intersections of many such groups--it is often called "biased." While much of the work in algorithmic fairness over the last several years has focused on developing various definitions of model fairness (the absence of group-wise model performance disparities) and eliminating such "bias," much less work has gone into rigorously measuring it. In practice, it important to have high quality, human digestible measures of model performance disparities and associated uncertainty quantification about them that can serve as inputs into multi-faceted decision-making processes. In this paper, we show both mathematically and through simulation that many of the metrics used to measure group-wise model performance disparities are themselves statistically biased estimators of the underlying quantities they purport to represent. We argue that this can cause misleading conclusions about the relative group-wise model performance disparities along different dimensions, especially in cases where some sensitive variables consist of categories with few members. We propose the "double-corrected" variance estimator, which provides unbiased estimates and uncertainty quantification of the variance of model performance across groups. It is conceptually simple and easily implementable without statistical software package or numerical optimization. We demonstrate the utility of this approach through simulation and show on a real dataset that while statistically biased estimators of model group-wise model performance disparities indicate statistically significant between-group model performance disparities, when accounting for statistical bias in the estimator, the estimated group-wise disparities in model performance are no longer statistically significant.
翻译:当模型在社会上或文化上相关群体(如种族、性别或许多此类群体之间的交叉点 -- -- 类似种族、性别或不同群体的不同性能不同时,模型的绩效通常被称为 " 偏差 " 。 虽然过去几年中在算法公平性方面的大量工作侧重于制定各种模型公平性定义(没有群体间模式性业绩差异)和消除这种“偏差”的定义,但这项工作在严格度上却少得多。在实践中,重要的是要有高质量的、人可消化的模型性业绩差异计量标准,并用相关的不确定性量化衡量这些差异,作为多面决策进程的投入。在本文件中,我们用数学和模拟方法显示,用于衡量群体间模式性能差异的许多衡量标准本身在统计上是有偏差的。我们说,这可能会导致关于不同层面相对群体间业绩差异的误导性结论,特别是在一些敏感变量由模式组成的情况下。我们提出“不精确的”估计值差异,这些差异可以提供不偏倚的估计数和不同类别间模式性差异的不确定性。我们用模拟方法显示,在统计学模型中,在统计上,我们用一个简单、不偏差的统计性的业绩模拟方法来显示,我们用这个统计模型显示,在统计模型中,在统计上显示一个统计性业绩的模型中,在统计上是简单、不精确的模型上显示一个统计性能的模型上,在统计性业绩的模型上,在统计性能的模型上显示一种简单的模型上显示,在统计性业绩的模型上是简单的模型上展示。