In real-world applications of large-scale time series, one often encounters the situation where the temporal patterns of time series, while drifting over time, differ from one another in the same dataset. In this paper, we provably show under such heterogeneity, training a forecasting model with commonly used stochastic optimizers (e.g. SGD) potentially suffers large gradient variance, and thus requires long time training. To alleviate this issue, we propose a sampling strategy named Subgroup Sampling, which mitigates the large variance via sampling over pre-grouped time series. We further introduce SCott, a variance reduced SGD-style optimizer that co-designs subgroup sampling with the control variate method. In theory, we provide the convergence guarantee of SCott on smooth non-convex objectives. Empirically, we evaluate SCott and other baseline optimizers on both synthetic and real-world time series forecasting problems, and show SCott converges faster with respect to both iterations and wall clock time. Additionally, we show two SCott variants that can speed up Adam and Adagrad without compromising generalization of forecasting models.
翻译:在大规模时间序列的实际应用中,人们常常遇到时间序列的时间模式随时间变化而随时间变化而不同的情况,在同一数据集中,时间序列的时间模式彼此不同。在本文中,我们可以明显地显示,在这种异质下,我们用常用的随机优化器(如SGD)培训了一个预测模型,这种模型有可能产生巨大的梯度差异,因此需要长期培训。为了缓解这一问题,我们提议了一个名为分组抽样的取样战略,通过对预分组时间序列的取样来减少巨大的差异。我们进一步引入了Sott,一个变异性降低的SGD型优化器,用控制变异法共同设计分组取样。理论上,我们为SCOott提供了光滑的非convex目标的趋同保证。我们随机地评估合成和现实世界时间序列的Sott和其他基线优化器问题,并显示在循环和墙时钟两方面的趋同速度更快。此外,我们展示了两个可加速亚当和Adagrad的模型,而不会破坏总体预测。