Feature extraction is an essential task in graph analytics. These feature vectors, called graph descriptors, are used in downstream vector-space-based graph analysis models. This idea has proved fruitful in the past, with spectral-based graph descriptors providing state-of-the-art classification accuracy. However, known algorithms to compute meaningful descriptors do not scale to large graphs since: (1) they require storing the entire graph in memory, and (2) the end-user has no control over the algorithm's runtime. In this paper, we present streaming algorithms to approximately compute three different graph descriptors capturing the essential structure of graphs. Operating on edge streams allows us to avoid storing the entire graph in memory, and controlling the sample size enables us to keep the runtime of our algorithms within desired bounds. We demonstrate the efficacy of the proposed descriptors by analyzing the approximation error and classification accuracy. Our scalable algorithms compute descriptors of graphs with millions of edges within minutes. Moreover, these descriptors yield predictive accuracy comparable to the state-of-the-art methods but can be computed using only 25% as much memory.
翻译:摘要:特征提取是图分析中的一个关键任务。这些特征向量被称为图描述符,用于下游基于向量空间的图分析模型中。过去证明,基于谱的图描述符提供了最先进的分类准确性。然而,已知的计算有意义的描述符的算法不适用于大型图形,因为它们要求在内存中存储整个图形,并且最终用户无法控制算法的运行时间。在本文中,我们提出了流式算法来近似计算涵盖图的基本结构的三种不同的图描述符。在边流上运行可以避免将整个图形存储在内存中,并且控制样本大小可以使我们将算法的运行时间保持在所需的范围内。我们通过分析逼近误差和分类准确性来展示所提出描述符的有效性。我们的可扩展算法可以在几分钟内计算数百万边的图的描述符。此外,这些描述符产生的预测准确度与最先进的方法相当,但可以使用只有其25%的内存进行计算。