Multicalibration is a desirable fairness criteria that constrains calibration error among flexibly-defined groups in the data while maintaining overall calibration. However, when outcome probabilities are correlated with group membership, multicalibrated models can exhibit a higher percent calibration error among groups with lower base rates than groups with higher base rates. As a result, it remains possible for a decision-maker to learn to trust or distrust model predictions for specific groups. To alleviate this, we propose \emph{proportional multicalibration}, a criteria that constrains the percent calibration error among groups and within prediction bins. We prove that satisfying proportional multicalibration bounds a model's multicalibration as well its \emph{differential calibration}, a stronger fairness criteria inspired by the fairness notion of sufficiency. We provide an efficient algorithm for post-processing risk prediction models for proportional multicalibration and evaluate it empirically. We conduct simulation studies and investigate a real-world application of PMC-postprocessing to prediction of emergency department patient admissions. We observe that proportional multicalibration is a promising criteria for controlling simultaneous measures of calibration fairness of a model over intersectional groups with virtually no cost in terms of classification performance.
翻译:多校准是一种可取的公平标准,它限制数据中定义较松的群体之间的校准错误,同时保持总体校准。然而,当结果概率与群体成员资格相关时,多校准模型在基准率较低的群体中可能比基准率较高的群体中显示更高比例校准错误。因此,决策者仍然有可能学习对特定群体的信任或不信任模型预测。为了减轻这一影响,我们提议进行模拟研究,并调查在现实世界范围内应用PMC后处理来预测紧急部门病人的入院情况。我们发现,符合比例的多校准将模型的多校准以及该模型的多校准捆绑在一起,这是受公平性概念启发的更公平的标准。我们为后处理风险预测模型提供了一种高效的算法,用于比例多校准和实情评估。我们发现,比例的多校准是控制同步性标准,在统一性质量分类中,不采用一个可靠的标准。