Consider two or more forecasters, each making a sequence of predictions for different events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts or outcomes were generated? This work presents a novel and rigorous answer to this question. We design a sequential inference procedure for estimating the time-varying difference in forecast quality as measured by any scoring rule. The resulting confidence intervals are nonasymptotically valid and can be continuously monitored to yield statistically valid comparisons at arbitrary data-dependent stopping times ("anytime-valid"); this is enabled by adapting variance-adaptive supermartingales, confidence sequences, and e-processes to our setting. Motivated by Shafer and Vovk's game-theoretic probability, our coverage guarantees are also distribution-free, in the sense that they make no distributional assumptions on the forecasts or outcomes. In contrast to a recent work by Henzi and Ziegel, our tools can sequentially test a weak null hypothesis about whether one forecaster outperforms another on average over time. We demonstrate their effectiveness by comparing probability forecasts on Major League Baseball (MLB) games and statistical postprocessing methods for ensemble weather forecasts.
翻译:考虑两个或更多的预测者, 每一个都对不同事件进行一系列的预测。 我们问了一个相对基本的问题: 我们如何比较这些在线或热后预测者, 同时避免无法核实的预测或结果是如何产生的假设? 这项工作对这个问题提出了一个新颖而严格的答案。 我们设计了一个顺序推论程序, 用来估计根据任何评分规则测量的预测质量的时间差异。 由此得出的信任期不具有临时效力, 并且可以不断监测, 在任意依赖数据的中断时间( “ 任何时间- valid ” ) 下进行统计上有效的比较; 我们可以通过调整差异适应性超配对、 信任序列和电子过程来促成这些预测者。 受Shafer 和 Vovk 游戏理论概率的激励, 我们的覆盖范围保障也是没有分配的, 也就是说, 它们不会对预报或结果做出分配性假设。 与Henzi和Ziegel最近的工作相比, 我们的工具可以按顺序测试一个薄弱的无效的假设, 是否对一个预测性超标超标的超级超标, 以及另一个统计周期的预测 。