Consider two forecasters, each making a single prediction for a sequence of events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts and outcomes were generated? In this paper, we present a rigorous answer to this question by designing novel sequential inference procedures for estimating the time-varying difference in forecast scores. To do this, we employ confidence sequences (CS), which are sequences of confidence intervals that can be continuously monitored and are valid at arbitrary data-dependent stopping times ("anytime-valid"). The widths of our CSs are adaptive to the underlying variance of the score differences. Underlying their construction is a game-theoretic statistical framework, in which we further identify e-processes and p-processes for sequentially testing a weak null hypothesis -- whether one forecaster outperforms another on average (rather than always). Our methods do not make distributional assumptions on the forecasts or outcomes; our main theorems apply to any bounded scores, and we later provide alternative methods for unbounded scores. We empirically validate our approaches by comparing real-world baseball and weather forecasters.
翻译:考虑两个预测者, 每个人对一个时间序列的事件进行单一的预测。 我们问了一个相对基本的问题: 我们如何比较这些在线或热后预测者, 同时避免无法核实的预测和结果是如何产生的假设? 在本文中, 我们提出一个严格的答案, 设计新的序列推论程序来估计预测分数的时间差。 为了做到这一点, 我们使用信任序列( CS), 这是信任期的序列, 可以持续监测, 在任意依赖数据的停留时间( “ 任何时间值 ” ) 有效 。 我们的 CS 宽度适应得分差异的根本差异。 其构建基础是一个游戏理论统计框架, 我们在这个框架中进一步确定电子过程和进程, 用于连续测试一个薄弱的无效假设 -- 一个人的预测是否平均( 而不是总是) 。 我们的方法不会在预测或结果上作出分布假设; 我们的主要参数适用于任何受约束的分数, 并且我们后来提供无限制的天气分数的替代方法。 我们用实验性的方法来验证我们真实的天气预报。