Multicalibration is a desirable fairness criteria that constrains calibration error among flexibly-defined groups in the data while maintaining overall calibration. However, when outcome probabilities are correlated with group membership, multicalibrated models can exhibit a higher percent calibration error among groups with lower base rates than groups with higher base rates. As a result, it remains possible for a decision-maker to learn to trust or distrust model predictions for specific groups. To alleviate this, we propose proportional multicalibration, a criteria that constrains the percent calibration error among groups and within prediction bins. We prove that satisfying proportional multicalibration bounds a model's multicalibration as well its differential calibration, a stronger fairness criteria inspired by the fairness notion of sufficiency. We provide an efficient algorithm for post-processing risk prediction models for proportional multicalibration and evaluate it empirically. We conduct simulation studies and investigate a real-world application of PMC-postprocessing to prediction of emergency department patient admissions. We observe that proportional multicalibration is a promising criteria for controlling simultenous measures of calibration fairness of a model over intersectional groups with virtually no cost in terms of classification performance.
翻译:多校准是一种可取的公平标准,它限制数据中定义较松的群体之间的校准错误,同时保持总体校准。然而,当结果概率与群体成员资格相关时,多校准模型在基准率较低的群体中可能出现比基准率较低的群体更高的百分比校准错误。因此,决策者仍然有可能学习对特定群体进行信任或不信任模型预测。为了减轻这一影响,我们提议了比例多校准,这是一个限制群体之间和预测箱内校准百分比错误的标准。我们证明,满足比例多校准将模型的多校准及其差异校准捆绑在一起,这是受公平充足性概念启发的更强有力的公平标准。我们为后处理风险预测模型提供了一种高效的算法,用于比例多校准并进行实验性评估。我们进行模拟研究并调查在现实世界应用PMC-后处理来预测紧急部门病人入院的情况。我们发现,比例多校准是控制模型跨交叉组的校准公平性模度的有力标准,几乎没有成本分类。