分区分割式VC: SSD 的分区外部内存图分析框架 (PartitionedVC: Partitioned External Memory Graph Analytics Framework for SSDs)

Graphs analytics are at the heart of a broad range of applications such as drug discovery, page ranking, transportation systems, and recommendation systems. When graph size exceeds memory size, out-of-core graph processing is needed. For the widely used external memory graph processing systems, accessing storage becomes the bottleneck. We make the observation that nearly all graph algorithms have a dynamically varying number of active vertices that must be processed in each iteration. However, existing graph processing frameworks, such as GraphChi, load the entire graph in each iteration even if a small fraction of the graph is active. This limitation is due to the structure of the data storage used by these systems. In this work, we propose to use a compressed sparse row (CSR) based graph storage that is more amenable for selectively loading only a few active vertices in each iteration. But CSR based graph processing suffers from random update propagation to many target vertices. To solve this challenge we propose to use a multi-log update mechanism which logs updates separately, rather than directly update the active edge in a graph. Our proposed multi-log system maintains a separate log per each vertex interval. This separation enables us to efficiently process each vertex interval by just loading the corresponding log. Over the current state of the art out-of-core graph processing framework, our evaluation results show that the PartitionedVC framework improves performance by up to $16.40\times$, $1.13\times$, $1.64\times$, $1.38\times$, and $2.76\times$ for the widely used breadth-first search, pagerank, community detection, graph coloring, and the maximal independent set applications, respectively.

翻译：图表分析是药物发现、页面排名、运输系统和建议系统等广泛应用的核心。当图形大小超过内存大小时, 需要超出核心的图形处理。对于广泛使用的外部内存图处理系统, 访问存储会成为瓶颈。我们观察到几乎所有图表算法都有动态不同的主动顶点, 在每个迭代中必须处理。然而, 现有的图表处理框架, 如 GraphChi, 将整张图表加在每迭代中, 即使图表中有一小部分是活动的。这一限制是由于这些系统所使用的数据存储结构造成的。对于广泛使用的外部内存图处理系统, 需要使用一个压缩的稀薄的行( CSR) 基图存储系统, 它更便于有选择地在每次迭代代数中只装几个活跃的顶点。但是基于 CSR 的图形处理程序有随机更新传播到许多目标顶点。为了应对这一挑战, 我们提议使用一个多logal更新的图表, 而不是直接更新一个活跃的平面的平面的平面图。我们提议的多式平价框架, 将每个平面的平面平面的平面的平面的平面运行系统使用一个不同的平面记录, 显示每个平面的平面的平面的平面记录, 。