Forecasts crave a rating that reflects the forecast's quality in the context of what is possible in theory and what is reasonable to expect in practice. Granular forecasts in the regime of low count rates - as they often occur in retail, for which an intermittent demand of a handful might be observed per product, day, and location - are dominated by the inevitable statistical uncertainty of the Poisson distribution. This makes it hard to judge whether a certain metric value is dominated by Poisson noise or truly indicates a bad prediction model. To make things worse, every evaluation metric suffers from scaling: Its value is mostly defined by the predicted selling rate and the resulting rate-dependent Poisson noise, and only secondarily by the quality of the forecast. For any metric, comparing two groups of forecasted products often yields "the slow movers are performing worse than the fast movers" or vice versa - the na\"ive scaling trap. To distill the intrinsic quality of a forecast, we stratify predictions into buckets of approximately equal rate and evaluate metrics for each bucket separately. By comparing the achieved value per bucket to benchmarks, we obtain a scaling-aware rating of count forecasts. Our procedure avoids the na\"ive scaling trap, provides an immediate intuitive judgment of forecast quality, and allows to compare forecasts for different products or even industries.
翻译:预测希望得到一种反映预测质量的评级,这种评级在理论上是可能的,在实践上是合理预期的。低计率制度下,低计率制度(通常发生在零售业,每产品、日、地都可能观察到一小部分的间歇需求)的粒子预测受到Poisson分布不可避免的统计不确定性的支配。这使得很难判断某一指标值是否为Poisson噪音所主宰,或确实显示一个坏的预测模型。要让情况更糟,每个评价指标都受到缩放的影响:其价值大多由预测的销售率和由此产生的以比率为根据的Poisson噪音来界定,而仅次于预测的质量。对于任何一种指标,比较两种预测产品往往产生“慢动者的表现比快速移动者差”或反之更差的“缩放陷阱 ” 。为了淡化预报的内在质量,我们将预测压缩为大约相同比率的桶,并分别评估每桶的计量标准。通过将每一桶的已实现的价值与基准进行比较,我们甚至获得一个缩度-觉察测得的预测质量的评级。