Recently there has been increasing interest in developing and deploying deep graph learning algorithms for many graph analysis tasks such as node and edge classification, link prediction, and clustering with numerous practical applications such as fraud detection, drug discovery, or recommender systems. Allbeit there is a limited number of publicly available graph-structured datasets, most of which are tiny compared to production-sized applications with trillions of edges and billions of nodes. Further, new algorithms and models are benchmarked across similar datasets with similar properties. In this work, we tackle this shortcoming by proposing a scalable synthetic graph generation tool that can mimic the original data distribution of real-world graphs and scale them to arbitrary sizes. This tool can be used then to learn a set of parametric models from proprietary datasets that can subsequently be released to researchers to study various graph methods on the synthetic data increasing prototype development and novel applications. Finally, the performance of the graph learning algorithms depends not only on the size but also on the dataset's structure. We show how our framework generalizes across a set of datasets, mimicking both structural and feature distributions as well as its scalability across varying dataset sizes.
翻译:最近,人们越来越有兴趣为许多图表分析任务,如节点和边缘分类、链接预测、与欺诈检测、毒品发现或推荐系统等许多实际应用的组合等,开发和部署深图学习算法。尽管公开提供的图表结构数据集数量有限,其中多数与生产规模应用程序相比微不足道,有数万亿边缘和数十亿节点。此外,新的算法和模型在类似性质数据集中的基准基准化。在这项工作中,我们提出一个可缩放的合成图形生成工具,以模拟真实世界图的原始数据分布,并将之缩放到任意大小。然后,这一工具可用于从专有数据集中学习一套参数模型,随后可以提供给研究人员,研究合成数据增加原型和新应用程序的各种图形方法。最后,图形学习算法的性能不仅取决于其大小,而且还取决于数据集的结构。我们展示了框架如何在一组数据集中进行总体化,在结构上和地貌分布上进行模拟结构分布,作为不同大小的数据,同时显示其结构分布情况,并显示其大小的变化。