Forecasts crave a rating that reflects the forecast's quality in the context of what is possible in theory and what is reasonable to expect in practice. Granular forecasts in the regime of low count rates - as they often occur in retail, for which an intermittent demand of a handful might be observed per product, day, and location - are dominated by the inevitable statistical uncertainty of the Poisson distribution. This makes it hard to judge whether a certain metric value is dominated by Poisson noise or truly indicates a bad prediction model. To make things worse, every evaluation metric suffers from scaling: Its value is mostly defined by the predicted selling rate and the resulting rate-dependent Poisson noise, and only secondarily by the quality of the forecast. For any metric, comparing two groups of forecasted products often yields "the slow movers are performing worse than the fast movers" or vice versa - the na\"ive scaling trap. To distill the intrinsic quality of a forecast, we stratify predictions into buckets of approximately equal rate and evaluate metrics for each bucket separately. By comparing the achieved value per bucket to benchmarks, we obtain a scaling-aware rating of count forecasts. Our procedure avoids the na\"ive scaling trap, provides an immediate intuitive judgment of forecast quality, and allows to compare forecasts for different products or even industries.
翻译:暂无翻译