预测校准有顺序有效测试 (Sequentially valid tests for forecast calibration)

Forecasting and forecast evaluation are inherently sequential tasks. Predictions are often issued on a regular basis, such as every hour, day, or month, and their quality is monitored continuously. However, the classical statistical tools for forecast evaluation are static, in the sense that statistical tests for forecast calibration are only valid if the evaluation period is fixed in advance. Recently, e-values have been introduced as a new, dynamic method for assessing statistical significance. An e-value is a non-negative random variable with expected value at most one under a null hypothesis. Large e-values give evidence against the null hypothesis, and the multiplicative inverse of an e-value is a conservative p-value. E-values are particularly suitable for sequential forecast evaluation, since they naturally lead to statistical tests which are valid under optional stopping. This article proposes e-values for testing probabilistic calibration of forecasts, which is one of the most important notions of calibration. The proposed methods are also more generally applicable for sequential goodness-of-fit testing. We demonstrate that the e-values are competitive in terms of power when compared to extant methods, which do not allow sequential testing. Furthermore, they provide important and useful insights in the evaluation of probabilistic weather forecasts.

翻译：预测和预测评价是必然的相继任务。预测通常定期发布,如每小时、日或月,并不断监测其质量。然而,预测评价的典型统计工具是静态的,因为预测校准的统计测试只有在评价期提前固定的情况下才有效。最近,电子价值被引入为评估统计意义的一种新的动态方法。电子价值是一种非负性随机变量,在完全假设下,其预期值最多为一个。大型电子价值提供证据反对无效假设,电子价值的倍增反面是一种保守的p价值。电子价值特别适合顺序预测评价,因为它们自然导致统计测试,而这种测试在任择性停止的情况下是有效的。这一文章提出了测试预测的概率校准电子价值,这是最重要的校准概念之一。拟议方法也更普遍地适用于连续性良好测试。我们证明,电子价值在能力方面与远端的天气预测方法相比具有竞争性,因此无法进行重要的连续性观测。此外,电子价值提供了重要的连续性预测。此外,电子价值提供其评估在与远端的天气预测中具有竞争性。