A major bottleneck of the current Machine Learning (ML) workflow is the time consuming, error prone engineering required to get data from a datastore or a database (DB) to the point an ML algorithm can be applied to it. Hence, we explore the feasibility of directly integrating prediction functionality on top of a data store or DB. Such a system ideally: (i) provides an intuitive prediction query interface which alleviates the unwieldy data engineering; (ii) provides state-of-the-art statistical accuracy while ensuring incremental model update, low model training time and low latency for making predictions. As the main contribution we explicitly instantiate a proof-of-concept, tspDB, which directly integrates with PostgreSQL. We rigorously test tspDB's statistical and computational performance against the state-of-the-art time series algorithms, including a Long-Short-Term-Memory (LSTM) neural network and DeepAR (industry standard deep learning library by Amazon). Statistically, on standard time series benchmarks, tspDB outperforms LSTM and DeepAR with 1.1-1.3x higher relative accuracy. Computationally, tspDB is 59-62x and 94-95x faster compared to LSTM and DeepAR in terms of median ML model training time and prediction query latency, respectively. Further, compared to PostgreSQL's bulk insert time and its SELECT query latency, tspDB is slower only by 1.3x and 2.6x respectively. That is, tspDB is a real-time prediction system in that its model training / prediction query time is similar to just inserting / reading data from a DB. As an algorithmic contribution, we introduce an incremental multivariate matrix factorization based time series method, which tspDB is built off. We show this method also allows one to produce reliable prediction intervals by accurately estimating the time-varying variance of a time series, thereby addressing an important problem in time series analysis.
翻译:目前机器学习(ML) 工作流程的一大瓶颈是时间消耗, 错误易变工程, 以便从数据存储或数据库( DB) 获取数据到 ML 算法。 因此, 我们探索直接整合数据存储或 DB 上方的预测功能的可行性。 这样的系统最好:(一) 提供一个直观的预测查询界面, 缓解不易变的数据工程;(二) 提供最新水平的统计准确性, 同时确保不断更新的模型、 低的模型培训时间和低的预测时间。 由于我们明确将数据存储的校验时间和值应用到它。 因此, 我们严格测试 tspDB的统计和计算性计算功能, 包括远程插入时间序列(LSTM) 网络和深层(通过亚马逊的行业标准深度学习图书馆) 。 在标准时间序列基准上, 直线显示 AL- LAR3 数据采集的校验数据校验数据, 比较SL95 的精确度。 我们严格测试 tBDB 的统计和计算方法, 直径, 直径对时间序列的计算, 直径, 直径, 直径, 直径, 和直径, 正在显示, 直径, 直径, 直径, 直调的内, 和直径, 直调, 时间序列的, 直径, 直径调, 数据, 直调, 直径调, 直径, 直调, 时间序列, 直调, 直径, 直调, 直调, 数据, 数据, 流 流 流, 流, 数据, 流, 数据, 直调, 向, 直调, 向, 直调, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 直调, 直调, 直调, 直调, 流, 流, 流, 流, 流, 流 流 流, 流 流 流, 流, 流, 流, 流, 流,