[技术报将取样和同步与最坏情况最佳运行时间和质量保障相结合,以图样红心估计 ([Technical Report] Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation)

Graph pattern cardinality estimation is the problem of estimating the number of embeddings of a query graph in a data graph. This fundamental problem arises, for example, during query planning in subgraph matching algorithms. There are two major approaches to solving the problem: sampling and synopsis. Synopsis (or summary)-based methods are fast and accurate if synopses capture information of graphs well. However, these methods suffer from large errors due to loss of information during summarization and inherent assumptions. Sampling-based methods are unbiased but suffer from large estimation variance due to large sample space. To address these limitations, we propose Alley, a hybrid method that combines both sampling and synopses. Alley employs 1) a novel sampling strategy, random walk with intersection, which effectively reduces the sample space, 2) branching to further reduce variance, and 3) a novel mining approach that extracts and indexes tangled patterns as synopses which are inherently difficult to estimate by sampling. By using them in the online estimation phase, we can effectively reduce the sample space while still ensuring unbiasedness. We establish that Alley has worst-case optimal runtime and approximation quality guarantees for any given error bound $\epsilon$ and required confidence $\mu$. In addition to the theoretical aspect of Alley, our extensive experiments show that Alley outperforms the state-of-the-art methods by up to orders of magnitude higher accuracy with similar efficiency.

翻译：图形基本特征估计是估算数据图表中查询图表嵌入数量的问题。例如,在子谱匹配算法的查询规划中,出现这一根本问题。有两个主要方法可以解决问题:抽样和简要说明; 概要(或摘要)法是快速和准确的,如果对图表信息进行综合,如果对图表进行精确的收集; 然而,这些方法由于总和和和固有假设中的信息丢失而存在大量错误。抽样方法是公正的,但因样本空间大而存在很大的估计差异。为了解决这些限制,我们建议了Alley,这是一种混合方法,将取样和合成结合起来。 Alley使用的一种混合方法。一种新型抽样战略,即随机行走和交叉,有效地缩小样本空间,2个分支法是快速和准确的,3 一种新颖的采矿方法,即提取和指数,作为必然难以通过抽样估计的组合。通过在网上估计阶段使用这些方法,我们可以有效地减少样本空间,同时确保公正性。我们确定,All$最差的运行和接近性质量保证,所有美元的运行和接近性质量保证,一个州级的试验需要所有最差的基数。