Time series forecasting has been a quintessential problem in data science for decades, with applications ranging from astronomy to zoology. A long time series may not be necessary in practice to achieve only a desired level of prediction accuracy. This work addresses the following fundamental question: How much recent historical data is required to achieve a targeted percentage of statistical prediction efficiency compared to the full time series data? Consequently, the sequential back subsampling (SBS) method, a novel dual efficient forecasting framework, is proposed to estimate the percentage of most recent historical data that achieves computational efficiency (via subsampling) while maintaining a desired level of prediction accuracy (almost as good as compared to full data). Theoretical justification using the asymptotic prediction theory based on traditional AutoRegressive (AR) Models is provided. This framework has been shown to work for recent machine learning forecasting methods even when the models might be misspecified, with empirical illustration using both simulated data and applications to data on financial stock prices and covid-19. The main conclusion is that only a fraction of the most recent historical data provides near-optimal or even better practically relevant predictive accuracy for a broad class of forecasting methods.
翻译:数十年来,时间序列预测一直是数据科学的一个典型问题,其应用范围从天文学到动物学。在实践中,可能不需要很长的时间序列来仅仅达到预期的准确性水平。这项工作解决了以下基本问题:与全时序列数据相比,要达到统计预测效率的目标百分比,需要多少最新历史数据?因此,建议采用相继背次抽样(SBS)方法,即新的双重高效预测框架,来估计最新历史数据中达到计算效率(通过子取样),同时保持理想的预测准确度(与完整数据相比,几乎是好的)的百分比。提供了使用基于传统自动递增模型的随机预测理论的理论理由。这一框架已证明了即使在模型可能错误地描述时,即使模拟数据和应用金融股票价格和covid-19数据的经验性说明,也可用于最近的机器学习预测方法。主要结论是,只有一小部分最新历史数据为广泛的预测方法提供了近于最优化或甚至更接近实际相关的预测准确性。