We study the fundamental question of how to define and measure the distance from calibration for probabilistic predictors. While the notion of perfect calibration is well-understood, there is no consensus on how to quantify the distance from perfect calibration. Numerous calibration measures have been proposed in the literature, but it is unclear how they compare to each other, and many popular measures such as Expected Calibration Error (ECE) fail to satisfy basic properties like continuity. We present a rigorous framework for analyzing calibration measures, inspired by the literature on property testing. We propose a ground-truth notion of distance from calibration: the $\ell_1$ distance to the nearest perfectly calibrated predictor. We define a consistent calibration measure as one that is polynomially related to this distance. Applying our framework, we identify three calibration measures that are consistent and can be estimated efficiently: smooth calibration, interval calibration, and Laplace kernel calibration. The former two give quadratic approximations to the ground truth distance, which we show is information-theoretically optimal in a natural model for measuring calibration which we term the prediction-only access model. Our work thus establishes fundamental lower and upper bounds on measuring the distance to calibration, and also provides theoretical justification for preferring certain metrics (like Laplace kernel calibration) in practice.
翻译:我们研究了如何定义和衡量概率预测器的距离校准的基本问题。虽然完美校准的概念已经被广泛理解,但关于如何量化距离完美校准的距离,仍然没有共识。已经在文献中提出了许多校准度量,但它们彼此间的比较并不清楚,而且许多流行的度量,如期望校准误差(ECE),不能满足基本性质,如连续性。我们提出了一个严格的框架来分析校准度量,受到属性测试的文献启发。我们提出了距离校准的基本真实概念:到最近的完美校准预测器的L1距离。我们将一致的校准度量定义为与该距离多项式相关的度量。应用我们的框架,我们确定了三种一致且可以高效估算的校准度量:平滑校准、区间校准和拉普拉斯核校准。前两种度量给出了基本真实距离的二次近似,我们证明在我们称之为仅预测访问模型的自然测量校准的模型中,这是信息理论上最优的。因此,我们的工作建立了测量距离校准的基本下限和上限,并且为在实践中优先选择某些度量(如拉普拉斯核校准)提供了理论上的依据。