In large-scale time series forecasting, one often encounters the situation where the temporal patterns of time series, while drifting over time, differ from one another in the same dataset. In this paper, we provably show under such heterogeneity, training a forecasting model with commonly used stochastic optimizers (e.g. SGD) potentially suffers large variance on gradient estimation, and thus incurs long-time training. We show that this issue can be efficiently alleviated via stratification, which allows the optimizer to sample from pre-grouped time series strata. For better trading-off gradient variance and computation complexity, we further propose SCott (Stochastic Stratified Control Variate Gradient Descent), a variance reduced SGD-style optimizer that utilizes stratified sampling via control variate. In theory, we provide the convergence guarantee of SCott on smooth non-convex objectives. Empirically, we evaluate SCott and other baseline optimizers on both synthetic and real-world time series forecasting problems, and demonstrate SCott converges faster with respect to both iterations and wall clock time.
翻译:在大规模的时间序列预测中,人们常常遇到时间序列的时间模式在时间序列随时间流而变化,在同一数据集中彼此不同的情况。在本文中,我们可以想象地显示,在这种异质下,我们用常用的随机优化器(如SGD)培训一种预测模型,这种模型有可能在梯度估计方面有很大差异,从而需要长期培训。我们表明,这一问题可以通过分层来有效缓解,通过分层使预分组时间序列层的样本能够优化。为了更好地交换梯度差异和计算复杂性,我们进一步提议了Sott(Stochacistic Sentrication Variate Egradient Emple),一种差异缩小的SGD型优化器,通过控制变异性来利用分层取样。理论上,我们为平稳的非康维克斯目标提供了Sott的趋同性保证。我们从考虑的角度评价合成和实际时间序列的Sott和其他基线优化器,并展示Stott在循环和墙时钟两方面的趋同速度。