Feature extraction is an essential task in graph analytics. These feature vectors, called graph descriptors, are used in downstream vector-space-based graph analysis models. This idea has proved fruitful in the past, with spectral-based graph descriptors providing state-of-the-art classification accuracy. However, known algorithms to compute meaningful descriptors do not scale to large graphs since: (1) they require storing the entire graph in memory, and (2) the end-user has no control over the algorithm's runtime. In this paper, we present streaming algorithms to approximately compute three different graph descriptors capturing the essential structure of graphs. Operating on edge streams allows us to avoid storing the entire graph in memory, and controlling the sample size enables us to keep the runtime of our algorithms within desired bounds. We demonstrate the efficacy of the proposed descriptors by analyzing the approximation error and classification accuracy. Our scalable algorithms compute descriptors of graphs with millions of edges within minutes. Moreover, these descriptors yield predictive accuracy comparable to the state-of-the-art methods but can be computed using only 25% as much memory.
翻译:图形分析分析中, 特征提取是一项基本任务 。 这些特性矢量, 称为图形描述器, 用于下游矢量- 空基图形分析模型。 这个想法在过去已经证明是富有成果的, 光谱图形描述器提供了最新分类准确性。 然而, 计算有意义的描述器的已知算法不比大图表大, 因为:(1) 它们需要将整张图形存储在记忆中, (2) 终端用户无法控制算法的运行时间 。 在本文中, 我们提出流动算法, 以大致计算三个不同的图形描述器, 捕捉图形的基本结构 。 在边缘流操作, 使我们能够避免将整张图形存储在记忆中, 并控制样本大小, 使我们能够将我们算法的运行时间保持在理想范围内 。 我们通过分析近似错误和分类精确性来显示拟议描述器的功效 。 我们可缩放的算法在分钟内用数百万边缘来计算图形的描述器。 此外, 这些解算法产生的预测性准确性可以与状态- 艺术方法相比, 只能使用 25 的 。