Multivariate probabilistic time series forecasts are commonly evaluated via proper scoring rules, i.e., functions that are minimal in expectation for the ground-truth distribution. However, this property is not sufficient to guarantee good discrimination in the non-asymptotic regime. In this paper, we provide the first systematic finite-sample study of proper scoring rules for time-series forecasting evaluation. Through a power analysis, we identify the "region of reliability" of a scoring rule, i.e., the set of practical conditions where it can be relied on to identify forecasting errors. We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions, and we gauge the generalizability of our findings to real-world tasks with an application to an electricity production problem. Our results reveal critical shortcomings in the evaluation of multivariate probabilistic forecasts as commonly performed in the literature.
翻译:多元概率时间序列预测通常通过适当的评分规则进行评估,即在基准真实分布的期望下最小的函数。然而,这一特性在非渐进区间内并不足以保证良好的区分度。在本文中,我们首次对时序预测评估的适当评分规则进行了系统的有限样本研究。通过一项功率分析,我们确定了评分规则的“可靠区间”,即可以在其中凭借其来确定预测误差的实用条件的集合。我们在一个全面的合成基准测试中进行了分析,特别设计用于测试基础真实分布与预测分布之间的几个关键差异,并测试我们的发现对一个电力生产问题的实际任务的适用性。我们的结果揭示了现有文献中在多元概率预测评估的执行中存在的重大缺陷。