高效地计算连接查询的差数</s> (Computing the Difference of Conjunctive Queries Efficiently)

We investigate how to efficiently compute the difference result of two (or multiple) conjunctive queries, which is the last operator in relational algebra to be unraveled. The standard approach in practical database systems is to materialize the results for every input query as a separate set, and then compute the difference of two (or multiple) sets. This approach is bottlenecked by the complexity of evaluating every input query individually, which could be very expensive, particularly when there are only a few results in the difference. In this paper, we introduce a new approach by exploiting the structural property of input queries and rewriting the original query by pushing the difference operator down as much as possible. We show that for a large class of difference queries, this approach can lead to a linear-time algorithm, in terms of the input size and (final) output size, i.e., the number of query results that survive from the difference operator. We complete this result by showing the hardness of computing the remaining difference queries in linear time. Although a linear-time algorithm is hard to achieve in general, we also provide some heuristics that can provably improve the standard approach. At last, we compare our approach with standard SQL engines over graph and benchmark datasets. The experiment results demonstrate order-of-magnitude speedups achieved by our approach over the vanilla SQL.

翻译：我们调查如何有效地计算两个(或多个)相交查询的不同结果, 即两个( 或多个) 相交查询的差别结果, 这是将拆解的关系代数中最后一个运算符。实用数据库系统中的标准方法是将每个输入查询的结果作为单独一组来落实, 然后将两个( 或多个) 组的差异计算出来。这种方法受到对每个输入查询的复杂程度的制约, 每一个输入查询的复杂程度可能非常昂贵, 特别是当差异的结果只有很少的时候。在本文中, 我们引入了一种新的方法, 利用输入查询的结构属性, 并尽可能将差异运算员推倒重写原始查询。我们显示, 对于一个大类别的差异查询, 这个方法可以导致一个线性时间算法, 从输入大小和( 最终) 输出大小来看, 即从差异运算者所幸存的查询结果的数量。我们通过显示在线性时间里查询的难度来完成这一结果。尽管直线性时间算法很难实现, 我们也可以提供一些超标的原始查询方法, 来改进标准的S- L 标准速度方法。我们最后通过S- L 比较了我们的S- 标准的S- 速度方法。</s>