Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the data is learned to answer the queries. However, there is no theoretical understanding of why and when the ML based approaches perform well. Furthermore, since the ML approaches model the data, they fail to capitalize on any query specific information to improve performance in practice. In this paper, we focus on modeling ``queries'' rather than data and train neural networks to learn the query answers. This change of focus allows us to theoretically study our ML approach to provide a distribution and query dependent error bound for neural networks when answering RAQs. We confirm our theoretical results by developing NeuroSketch, a neural network framework to answer RAQs in practice. Extensive experimental study on real-world, TPC-benchmark and synthetic datasets show that NeuroSketch answers RAQs multiple orders of magnitude faster than state-of-the-art and with better accuracy.
翻译:范围聚合查询(RAQs)是许多实际应用的重要组成部分,通常需要快速和近似的查询答案。最近的研究着眼于使用机器学习(ML)模型来回答RAQs,其中学习数据模型以回答查询。然而,目前没有理论理解为什么和何时ML基于方法表现良好。此外,由于ML方法建模数据,他们未能利用任何查询特定的信息来提高实践中的性能。在本文中,我们专注于建模“查询”而不是数据,并训练神经网络学习查询答案。这种重点转变使我们能够在理论上研究我们的ML方法,以提供神经网络在回答RAQs时的分布和查询相关误差界限。我们通过开发神经素描来确认我们的理论结果,这是一个神经网络框架,用于实践中回答RAQs。在实际,TPC基准和合成数据集上进行了广泛的实验研究,结果显示,神经素描比最先进的方法回答RAQs快几个数量级,并具有更好的准确性。