Interactive response time is important in analytical pipelines for users to explore a sufficient number of possibilities and make informed business decisions. We consider a forecasting pipeline with large volumes of high-dimensional time series data. Real-time forecasting can be conducted in two steps. First, we specify the part of data to be focused on and the measure to be predicted by slicing, dicing, and aggregating the data. Second, a forecasting model is trained on the aggregated results to predict the trend of the specified measure. While there are a number of forecasting models available, the first step is the performance bottleneck. A natural idea is to utilize sampling to obtain approximate aggregations in real time as the input to train the forecasting model. Our scalable real-time forecasting system FlashP (Flash Prediction) is built based on this idea, with two major challenges to be resolved in this paper: first, we need to figure out how approximate aggregations affect the fitting of forecasting models, and forecasting results; and second, accordingly, what sampling algorithms we should use to obtain these approximate aggregations and how large the samples are. We introduce a new sampling scheme, called GSW sampling, and analyze error bounds for estimating aggregations using GSW samples. We introduce how to construct compact GSW samples with the existence of multiple measures to be analyzed. We conduct experiments to evaluate our solution and compare it with alternatives on real data.
翻译:在分析管道中,互动反应时间对于用户探索足够多的可能性和作出知情的商业决定十分重要。我们考虑的是具有大量高维时间序列数据的预测管道。实时预测可以分两个步骤进行。首先,我们具体说明数据重点的部分和通过切片、切片和汇总数据预测的措施。第二,对预测模型进行了关于预测特定措施趋势的综合结果的培训。虽然有一些预测模型,但第一步是性能瓶颈。一个自然的想法是利用取样实时获得近似总合作为用于培训预测模型的投入。我们可缩放的实时预报系统FlashP(Flash Survemention)基于这一想法,本文要解决两大挑战:首先,我们需要找出估计总合如何影响预测模型的适合性和预测结果;第二,因此,我们应使用哪些抽样算法来获得这些近似总合的集合和样本的大小。我们引入了一个新的取样计划,称为GSW取样,我们用G模型来进行实时的取样分析,我们用GSW模型来进行多重的抽样评估。我们用GSW模型来分析我们用G的样品的模型的模型来分析。