We investigate how to efficiently compute the difference result of two (or multiple) conjunctive queries, which is the last operator in relational algebra to be unraveled. The standard approach in practical database systems is to materialize the results for every input query as a separate set, and then compute the difference of two (or multiple) sets. This approach is bottlenecked by the complexity of evaluating every input query individually, which could be very expensive, particularly when there are only a few results in the difference. In this paper, we introduce a new approach by exploiting the structural property of input queries and rewriting the original query by pushing the difference operator down as much as possible. We show that for a large class of difference queries, this approach can lead to a linear-time algorithm, in terms of the input size and (final) output size, i.e., the number of query results that survive from the difference operator. We complete this result by showing the hardness of computing the remaining difference queries in linear time. Although a linear-time algorithm is hard to achieve in general, we also provide some heuristics that can provably improve the standard approach. At last, we compare our approach with standard SQL engines over graph and benchmark datasets. The experiment results demonstrate order-of-magnitude speedups achieved by our approach over the vanilla SQL.
翻译:我们研究如何高效地计算两个(或多个)连通查询的差异结果,这是关系代数中最后一个被揭示的运算符。实际数据库系统中的标准方法是将每个输入查询的结果分别材料化为单独的集合,然后计算两个(或多个)集合的差异。这种方法受限于单独评估每个输入查询的复杂性,特别是当差异中仅有少量结果时,代价可能非常高。在本文中,我们通过利用输入查询的结构特性,并通过尽可能地将差异运算符推向下方重写原始查询来介绍一种新的方法。我们展示了对于一类大型差异查询,这种方法可以导致一个线性时间算法,以输入大小和(最终)输出大小为基础,即从差异运算符中幸存下来的查询结果数量。我们通过展示在线性时间内计算剩余差异查询的难度来完成这个结果。虽然一般情况下很难实现线性时间算法,但我们还是提供了一些启发式方法,可以有效改进标准方法。最后,我们将我们的方法与图形和基准数据集上的标准 SQL 引擎进行了比较。实验结果证明,我们的方法能够实现数量级的加速比。