Probabilistic predictions can be evaluated through comparisons with observed label frequencies, that is, through the lens of calibration. Recent scholarship on algorithmic fairness has started to look at a growing variety of calibration-based objectives under the name of multi-calibration but has still remained fairly restricted. In this paper, we explore and analyse forms of evaluation through calibration by making explicit the choices involved in designing calibration scores. We organise these into three grouping choices and a choice concerning the agglomeration of group errors. This provides a framework for comparing previously proposed calibration scores and helps to formulate novel ones with desirable mathematical properties. In particular, we explore the possibility of grouping datapoints based on their input features rather than on predictions and formally demonstrate advantages of such approaches. We also characterise the space of suitable agglomeration functions for group errors, generalising previously proposed calibration scores. Complementary to such population-level scores, we explore calibration scores at the individual level and analyse their relationship to choices of grouping. We draw on these insights to introduce and axiomatise fairness deviation measures for population-level scores. We demonstrate that with appropriate choices of grouping, these novel global fairness scores can provide notions of (sub-)group or individual fairness.
翻译:近期的算法公平性奖学金开始以多校准的名义研究越来越多的基于校准的目标,但仍然相当有限。在本文件中,我们通过明确设计校准分所涉及的选择来探索和分析通过校准评估的形式。我们将这些选择分为三个组,并选择群体差错的聚集点。这为比较先前提议的校准分数提供了一个框架,并帮助制定具有理想数学属性的新颖的校准分数提供了框架。特别是,我们探索了基于其输入特征而不是预测而分组数据点的可能性,并正式展示了这些方法的优势。我们还在本文中,通过对校准分设计中所涉及的选择进行明确的区分,探索和分析通过校准进行校准的方式进行评估的形式。我们将这些选择分为三个组,将个人等级的校准分与组别之间的关系加以分析。我们利用这些洞察来介绍和对人口分数的公平偏差措施,特别是,我们探索基于其输入和对人口分数的公平性,我们展示了这些个人分数的适当组合的公平性。我们展示了这些分数的公平性。