We address the problem of computing the distribution of induced connected subgraphs, aka \emph{graphlets} or \emph{motifs}, in large graphs. The current state-of-the-art algorithms estimate the motif counts via uniform sampling, by leveraging the color coding technique by Alon, Yuster and Zwick. In this work we extend the applicability of this approach, by introducing a set of algorithmic optimizations and techniques that reduce the running time and space usage of color coding and improve the accuracy of the counts. To this end, we first show how to optimize color coding to efficiently build a compact table of a representative subsample of all graphlets in the input graph. For $8$-node motifs, we can build such a table in one hour for a graph with $65$M nodes and $1.8$B edges, which is $2000$ times larger than the state of the art. We then introduce a novel adaptive sampling scheme that breaks the "additive error barrier" of uniform sampling, guaranteeing multiplicative approximations instead of just additive ones. This allows us to count not only the most frequent motifs, but also extremely rare ones. For instance, on one graph we accurately count nearly $10.000$ distinct $8$-node motifs whose relative frequency is so small that uniform sampling would literally take centuries to find them. Our results show that color coding is still the most promising approach to scalable motif counting.
翻译:我们在大图表中处理计算导相连接子集、 aka \ emph{ graphletes} 或\ emph{ motifs} 的分布问题。 目前最先进的算法通过统一取样, 利用 Alon、 Yuster 和 Zwick 的颜色编码技术, 来估计 motif 值。 在这项工作中, 我们扩展了这个方法的应用范围, 引入了一套算法优化和技术, 减少了颜色编码的运行时间和空间使用, 并提高了计算准确性。 为此, 我们首先展示了如何优化颜色编码, 以高效地构建一个包含输入图中所有图表代表的子集的缩略表 。 对于 $8 $ node motif, 我们可以用一个小时来构建这样一个表, 以65 兆 节点和 1.8 美元 边框, 比艺术状态大20000 倍。 然后我们引入一个新的适应性抽样方法, 打破统一取样的“ 增加错误屏障 ”, 保证多度的codecodecodecodeal, 而不是仅以纯的 倍的频率计算。