Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the data is learned to answer the queries. However, there is no theoretical understanding of why and when the ML based approaches perform well. Furthermore, since the ML approaches model the data, they fail to capitalize on any query specific information to improve performance in practice. In this paper, we focus on modeling ``queries'' rather than data and train neural networks to learn the query answers. This change of focus allows us to theoretically study our ML approach to provide a distribution and query dependent error bound for neural networks when answering RAQs. We confirm our theoretical results by developing NeuroSketch, a neural network framework to answer RAQs in practice. Extensive experimental study on real-world, TPC-benchmark and synthetic datasets show that NeuroSketch answers RAQs multiple orders of magnitude faster than state-of-the-art and with better accuracy.
翻译:范围汇总查询(RAQs)是许多现实应用中不可分割的一部分,在现实应用中,往往需要快速和近似地回答查询。最近的工作利用机器学习(ML)模型研究了对RAQs的回答,在机器学习(ML)模型中学习了数据模型来回答查询。然而,对于基于ML的方法为何和何时运作良好,没有理论上的理解。此外,由于ML方法模拟数据,它们未能利用任何具体查询信息来改进实际业绩。在本文中,我们侧重于“queries'',而不是数据,并培训神经网络来学习查询答案。这种重点的改变使我们能够从理论上研究我们的ML方法,以便在回答RAQ时为神经网络提供一个分布和根据错误的约束。我们通过开发神经系统(NeuroSketch)来确认我们的理论结果,这是在实践中回答RAQs的神经网络框架。关于现实世界、TPC-nockmark和合成数据集的广泛实验研究表明,Neurosketch解答RAQ的多重数量级,比状态更快,而且准确性更高。