比较序列预报器 (Comparing Sequential Forecasters)

We consider two or more forecasters each making a sequence of predictions over time and tackle the problem of how to compare them -- either online or post-hoc. In fields ranging from meteorology to sports, forecasters make predictions on different events or quantities over time, and this work describes how to compare them in a statistically rigorous manner. Specifically, we design a nonasymptotic sequential inference procedure for estimating the time-varying difference in forecast quality when using a relatively large class of scoring rules (bounded scores with a linear equivalent). The resulting confidence intervals can be continuously monitored and yield statistically valid comparisons at arbitrary data-dependent stopping times ("anytime-valid"); this is enabled by adapting recent variance-adaptive confidence sequences (CS) to our setting. In the spirit of Shafer and Vovk's game-theoretic probability, the coverage guarantees for our CSs are also distribution-free, in the sense that they make no distributional assumptions whatsoever on the forecasts or outcomes. Additionally, in contrast to a recent preprint by Henzi and Ziegel, we show how to sequentially test a weak null hypothesis about whether one forecaster outperforms another on average over time, by designing different e-processes that quantify the evidence at any stopping time. We examine the validity of our methods over their fixed-time and asymptotic counterparts in synthetic experiments and demonstrate their effectiveness in real-data settings, including comparing probability forecasts on Major League Baseball (MLB) games and comparing statistical postprocessing methods for ensemble weather forecasts.

翻译：我们考虑的是两个或更多的预测者,他们各自在一段时间内作出一系列的预测,并解决如何比较预测的问题 -- -- 无论是在线还是事后。在气象学到体育的各个领域,预测者对不同事件或数量作出预测,这项工作描述了如何在统计上严格地加以比较。具体地说,我们设计了一种不方便的顺序推论程序,用以在使用相对大等级的评分规则(以线性等值计分)时估计预测质量的时间差异。由此产生的信任期可以不断监测,并在任意依赖数据的中断时间(“任何时间-valid”)进行统计上有效的比较;在从气象学到体育的各个领域,预测者预测者对不同事件或数量作出预测者作出预测。此外,与最近Henzi和Ziegel的预印结果相比,我们可以不断监测并产生具有统计效力的比较;通过调整最近的适应差异-适应性信心序列(CS)来适应我们的设置。在Shafer和Vovk的游戏理论可能性中,我们的CS值保障范围也是没有限制的,这意味着它们对预测或结果的分布上没有任何任何分布上的假设。此外,与预测者,我们在平均时间里程的预测时,我们用一种预测时的准确的准确的准确的预测是用来检验。