The vast amounts of data used in social, business or traffic networks, biology and other natural sciences are often managed in graph-based data sets, consisting of a few thousand up to billions and trillions of vertices and edges, respectively. Typical applications utilizing such data either execute one or a few complex queries or many small queries at the same time interactively or as batch jobs. Furthermore, graph processing is inherently complex, as data sets can substantially differ (scale free vs. constant degree), and algorithms exhibit diverse behavior (computational intensity, local or global, push- or pull-based). This work is concerned with multi-query execution by automatically controlling the degree of parallelization, with overall objectives including high system utilization, low synchronization cost, and highly efficient concurrent execution. The underlying concept is three-fold: (1) sampling is used to determine graph statistics, (2) parallelization constraints are derived from algorithm and system properties, and (3) suitable work packages are generated based on the previous two aspects. We evaluate the proposed concept using different algorithms on synthetic and real world data sets, with up to 16 concurrent sessions (queries). The results demonstrate a robust performance in spite of these various configurations, and in particular that the performance is always close to or even slightly ahead of the performance of manually optimized implementations. Furthermore, the similar performance to manually optimized implementations under extreme configurations, which require either a full parallelization (few large queries) or complete sequential execution (many small queries), shows that the proposed concept exhibits a particularly low overhead.
翻译:社会、商业或交通网络、生物学和其他自然科学中所使用的大量数据往往在基于图表的数据集中管理,分别由数千至数十亿和数万亿个脊椎和边缘组成。典型应用利用这些数据执行一个或几个复杂查询或许多小查询,同时互动或批量工作。此外,图表处理本身很复杂,因为数据集可能有很大差异(规模自由相对于恒定程度),算法显示不同行为(数字强度、地方或全球、推力或拉力)。这项工作涉及多拼拼法执行,自动控制平行程度,总体目标包括高系统利用率、低同步成本和高效同时执行。基本概念有三重:(1)抽样用来确定图表统计数据,(2)平行制约来自算法和系统特性,(3)根据前两个方面产生适当的工作包。我们用不同的合成和真实世界数据集算出不同的算法,同时举行16场(地震),结果显示一个稳健的进度,甚至连续进行大规模排序,尽管这些进度是接近的,但最后的进度也是最优化的。