Synthetic graph generators facilitate research in graph algorithms and processing systems by providing access to data, for instance, graphs resembling social networks, while circumventing privacy and security concerns. Nevertheless, their practical value lies in their ability to capture important metrics of real graphs, such as degree distribution and clustering properties. Graph generators must also be able to produce such graphs at the scale of real-world industry graphs, that is, hundreds of billions or trillions of edges. In this paper, we propose Darwini, a graph generator that captures a number of core characteristics of real graphs. Importantly, given a source graph, it can reproduce the degree distribution and, unlike existing approaches, the local clustering coefficient and joint-degree distributions. Furthermore, Darwini maintains metrics such node PageRank, eigenvalues and the K-core decomposition of a source graph. Comparing Darwini with state-of-the-art generative models, we show that it can reproduce these characteristics more accurately. Finally, we provide an open source implementation of our approach on the vertex-centric Apache Giraph model that allows us to create synthetic graphs with one trillion edges.
翻译:合成图形生成器通过提供数据访问便利了图表算法和处理系统的研究,例如,提供类似社交网络的图表,同时绕过隐私和安全关切。然而,它们的实际价值在于它们能够捕捉真实图形的重要度量,如度分布和组合属性。图形生成器还必须能够以真实世界工业图的规模,即数百亿或数万亿边缘规模来生成此类图表。在本文中,我们提议达尔文,这是一个图形生成器,可以捕捉真实图表的一些核心特征。重要的是,根据源图,它可以复制度分布,并且与现有方法不同,本地组群系数系数和联合度分布。此外,达尔文还维持了诸如PageRank、egen值和源图的K核心分解配置等指标。将达尔文与最先进的基因化模型进行比较,我们表明它能够更准确地复制这些特征。最后,我们提供了一种开放的来源,在垂直的阿基拉法模型上,我们可以创建一面的合成图。