Forecast evaluation plays a key role in how empirical evidence shapes the development of the discipline. Domain experts are interested in error measures relevant for their decision making needs. Such measures may produce unreliable results. Although reliability properties of several metrics have already been discussed, it has hardly been quantified in an objective way. We propose a measure named Rank Stability, which evaluates how much the rankings of an experiment differ in between similar datasets, when the models and errors are constant. We use this to study the evaluation setup of the M5. We find that the evaluation setup of the M5 is less reliable than other measures. The main drivers of instability are hierarchical aggregation and scaling. Price-weighting reduces the stability of all tested error measures. Scale normalization of the M5 error measure results in less stability than other scale-free errors. Hierarchical levels taken separately are less stable with more aggregation, and their combination is even less stable than individual levels. We also show positive tradeoffs of retaining aggregation importance without affecting stability. Aggregation and stability can be linked to the influence of much debated magic numbers. Many of our findings can be applied to general hierarchical forecast benchmarking.
翻译:预测评价在经验证据如何影响学科发展方面发挥着关键作用。 域专家对与其决策需要相关的错误措施感兴趣, 此类措施可能产生不可靠的结果。 虽然已经讨论过若干计量的可靠性特性,但几乎没有客观地量化。 我们提出一个名为 Rank Stability 的措施,在模型和错误不变的情况下,评估类似数据集之间实验的等级差异有多大; 我们用它来研究M5的评估设置。 我们发现M5的评估设置比其他措施的可靠性要低。 不稳定的主要驱动因素是等级汇总和缩放。 价格加权降低了所有测试的错误措施的稳定性。 M5误差衡量的尺度标准化结果比其他无比额表错误的稳定性要低。 单列的等级水平在总和总和性方面不那么稳定。 我们还用它来研究在不影响稳定性的情况下保留总和重要性的积极权衡取舍。 聚合和稳定性可以与许多有争议的神奇数字的影响相联系。 我们的许多调查结果可以应用于一般的等级预测基准。