We propose a new method for estimating the number of answers OUT of a small join query Q in a large database D, and for uniform sampling over joins. Our method is the first to satisfy all the following statements. - Support arbitrary Q, which can be either acyclic or cyclic, and contain binary and non-binary relations. - Guarantee an arbitrary small error with a high probability always in \~O(AGM/OUT) time, where AGM is the AGM bound OUT (an upper bound of OUT), and \~O hides the polylogarithmic factor of input size. We also explain previous join size estimators in a unified framework. All methods including ours rely on certain indexes on relations in D, which take linear time to build offline. Additionally, we extend our method using generalized hypertree decompositions (GHDs) to achieve a lower complexity than \~O(AGM/OUT) when OUT is small, and present optimization techniques for improving estimation efficiency and accuracy.
翻译:保证在连接查询中进行均匀抽样和OUT大小估计的O(AGM/OUT)运行时间
翻译后的摘要:
我们提出了一种新的方法来估计大数据库D中小连接查询Q的答案数量OUT,并进行连接上的均匀抽样。我们的方法是第一个满足以下所有观点的。 - 支持任意Q,可以是无环或有环,包含二进制和非二进制关系。 - 保证任意小的误差始终在\~O(AGM/OUT)时间内高概率出现,其中AGM是OUT的AGM上界(OUT的一个上界),\~O隐藏输入大小的多对数因子。我们还在统一框架下解释了先前的连接大小估计器。所有方法包括我们的方法都依赖于D中的某些关系的索引,这需要离线建立线性时间。此外,我们使用广义超树分解(GHD)扩展我们的方法以在OUT较小时实现比\~O(AGM/OUT)更低的复杂性,并提供提高估计效率和准确性的优化技术。