Forecast quality should be assessed in the context of what is possible in theory and what is reasonable to expect in practice. Often, one can identify an approximate upper bound to a probabilistic forecast's sharpness, which sets a lower, not necessarily achievable, limit to error metrics. In retail forecasting, a simple, but often unconquerable sharpness limit is given by the Poisson distribution. When evaluating forecasts using traditional metrics such as Mean Absolute Error, it is hard to judge whether a certain achieved value reflects unavoidable Poisson noise or truly indicates an over-dispersed prediction model. Moreover, every evaluation metric suffers from precision scaling: The metric's value is mostly defined by the selling rate and by the resulting rate-dependent Poisson noise, and only secondarily by the forecast quality. Comparing two groups of forecasted products often yields "the slow movers are performing worse than the fast movers" or vice versa, which we call the naïve scaling trap. To distill the intrinsic quality of a forecast, we stratify predictions into buckets of approximately equal predicted values and evaluate metrics separately per bucket. By comparing the achieved value per bucket to benchmarks defined by the theoretical expectation value of the metric, we obtain an intuitive visualization of forecast quality. This representation can be summarized by a single rating that makes forecast quality comparable among different products or even industries. The thereby developed scaling-aware forecast rating is applied to forecasting models used on the M5 competition dataset as well as to real-life forecasts provided by Blue Yonder's Demand Edge for Retail solution for grocery products in Sainsbury's supermarkets in the United Kingdom. The results permit a clear interpretation and high-level understanding of model quality by non-experts.
翻译:预测质量的评估应基于理论上的可能性和实践中的合理预期。通常,我们可以确定概率预测锐度的近似上限,这为误差指标设定了一个不一定可达到的下限。在零售预测中,泊松分布提供了一个简单但往往难以超越的锐度限制。使用传统指标(如平均绝对误差)评估预测时,很难判断某个达到的数值是反映了不可避免的泊松噪声,还是真正表明预测模型存在过度离散问题。此外,所有评估指标都受到精度尺度的影响:指标值主要由销售速率及由此产生的速率依赖性泊松噪声决定,预测质量仅起次要作用。比较两组预测产品时,常会得出“慢销品表现差于快销品”或相反的结论,我们称之为朴素尺度陷阱。为提取预测的内在质量,我们将预测结果按近似相等的预测值分层,并分别评估各层的指标。通过将每层达到的数值与基于指标理论期望值定义的基准进行比较,可获得预测质量的直观可视化表示。该表示可通过单一评级进行概括,使不同产品或行业间的预测质量具有可比性。由此开发的考虑尺度效应的预测评级方法,已应用于M5竞赛数据集的预测模型,以及Blue Yonder公司为英国Sainsbury's超市杂货产品提供的Demand Edge for Retail解决方案的实际预测。研究结果使非专家用户能够清晰解读并高度理解模型质量。