Average calibration of the prediction uncertainties of machine learning regression tasks can be tested in two ways: one is to estimate the calibration error (CE) as the difference between the mean absolute error (MSE) and the mean variance (MV) or mean squared uncertainty; the alternative is to compare the mean squared z-scores (ZMS) or scaled errors to 1. The problem is that both approaches might lead to different conclusions, as illustrated in this study for an ensemble of datasets from the recent machine learning uncertainty quantification (ML-UQ) literature. It is shown that the estimation of MV, MSE and their confidence intervals can become unreliable for heavy-tailed uncertainty and error distributions, which seems to be a common issue for ML-UQ datasets. By contrast, the ZMS statistic is less sensitive and offers the most reliable approach in this context. Unfortunately, the same problem affects also conditional calibrations statistics, such as the popular ENCE, and very likely post-hoc calibration methods based on similar statistics. As not much can be done to relieve this issue, except for a change of paradigm to intervals- or distribution-based UQ metrics, robust tailedness metrics are proposed to detect the potentially problematic datasets.
翻译:暂无翻译