Large-scale graphs with billions of edges are ubiquitous in many industries, science, and engineering fields such as recommendation systems, social graph analysis, knowledge base, material science, and biology. Graph neural networks (GNN), an emerging class of machine learning models, are increasingly adopted to learn on these graphs due to their superior performance in various graph analytics tasks. Mini-batch training is commonly adopted to train on large graphs, and data parallelism is the standard approach to scale mini-batch training to multiple GPUs. In this paper, we argue that several fundamental performance bottlenecks of GNN training systems have to do with inherent limitations of the data parallel approach. We then propose split parallelism, a novel parallel mini-batch training paradigm. We implement split parallelism in a novel system called gsplit and show that it outperforms state-of-the-art systems such as DGL, Quiver, and PaGraph.
翻译:大规模图,拥有数十亿的边,是许多行业、科学和工程领域中的一大特点,例如推荐系统、社交图分析、知识库、材料科学和生物学等。图神经网络(GNN)作为一种新兴的机器学习模型,由于在各种图分析任务中表现出色而越来越受到采用。Mini-batch训练是在大图上进行训练的常见方法,并且数据并行是将Mini-batch训练扩展到多GPU的标准方法。在本文中,我们认为GNN训练系统的几个基本性能瓶颈与数据并行方法的固有限制有关。然后,我们提出了一种新的小批量并行训练范式,称为“拆分并行”。我们将拆分并行实现在一个名为gsplit的新系统中,并展示了它优于现有的状态-of-the-art系统,如DGL,Quiver和PaGraph。