Model diagnostics and forecast evaluation are two sides of the same coin. A common principle is that fitted or predicted distributions ought to be calibrated or reliable, ideally in the sense of auto-calibration, where the outcome is a random draw from the posited distribution. For binary responses, this is the universal concept of reliability. For real-valued outcomes, a general theory of calibration has been elusive, despite a recent surge of interest in distributional regression and machine learning. We develop a framework rooted in probability theory, which gives rise to hierarchies of calibration, and applies to both predictive distributions and stand-alone point forecasts. In a nutshell, a prediction - distributional or single-valued - is conditionally T-calibrated if it can be taken at face value in terms of the functional T. Whenever T is defined via an identification function - as in the cases of threshold (non) exceedance probabilities, quantiles, expectiles, and moments - auto-calibration implies T-calibration. We introduce population versions of T-reliability diagrams and revisit a score decomposition into measures of miscalibration (MCB), discrimination (DSC), and uncertainty (UNC). In empirical settings, stable and efficient estimators of T-reliability diagrams and score components arise via nonparametric isotonic regression and the pool-adjacent-violators algorithm. For in-sample model diagnostics, we propose a universal coefficient of determination, $$\text{R}^\ast = \frac{\text{DSC}-\text{MCB}}{\text{UNC}},$$ that nests and reinterprets the classical $\text{R}^2$ in least squares (mean) regression and its natural analogue $\text{R}^1$ in quantile regression, yet applies to T-regression in general, with MCB $\geq 0$, DSC $\geq 0$, and $\text{R}^\ast \in [0,1]$ under modest conditions.
翻译:模型诊断和预测评价是同一硬币的两面。 一个共同的原则是,适合的或预测的分布应该校准或可靠,理想的是自动校正,结果是从假设的分布中随机提取的。 对于二进制反应,这是通用的可靠性概念。对于实际估价的结果来说,一个总的校正理论是难以实现的,尽管最近人们对分布回归和机器学习的兴趣激增。我们开发了一个基于概率理论的框架,它导致校准等级,并适用于预测的分布和独立点预报。在坚果中,一个预测-分发或单一估值-如果可以用函数的自然值当面值来进行的话,则是有条件的。当T通过一个识别功能来定义时,就像阈值(非)过敏性、缩略度、预测性模型和瞬间-自动校正度意味着T校正。我们引入了T-可靠性图表的版本,并且重新审视了货币-数字-数字-数字-数字-数字-数字-数字-数字-数字-持续的度(MMC)中,一个货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币-货币