加快快速抽样和管道快速取样的图形神经网络的培训和推断 (Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining)

Improving the training and inference performance of graph neural networks (GNNs) is faced with a challenge uncommon in general neural networks: creating mini-batches requires a lot of computation and data movement due to the exponential growth of multi-hop graph neighborhoods along network layers. Such a unique challenge gives rise to a diverse set of system design choices. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment, under which we identify major performance bottlenecks hitherto under-explored by developers: mini-batch preparation and transfer. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler, a shared-memory parallelization strategy, and the pipelining of batch transfer with GPU computation. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised. Such an observation unifies training and inference, simplifying model implementation. We report comprehensive experimental results with several benchmark data sets and GNN architectures, including a demonstration that, for the ogbn-papers100M data set, our system SALIENT achieves a speedup of 3x over a standard PyTorch-Geometric implementation with a single GPU and a further 8x parallel speedup with 16 GPUs. Therein, training a 3-layer GraphSAGE model with sampling fanout (15, 10, 5) takes 2.0 seconds per epoch and inference with fanout (20, 20, 20) takes 2.4 seconds, attaining test accuracy 64.58%.

翻译：改善图形神经网络(GNNS)的培训和推导性能在一般神经网络中面临一个不常见的挑战:由于网络层上多霍氏图形街区的指数性增长,创建微型芭蕾需要大量的计算和数据流动。这种独特的挑战产生了一套不同的系统设计选择。我们主张在分布式多GPU环境下进行小型批量培训,在分布式多GPU环境中进行社区取样,在这种环境中,我们发现迄今为止由开发者探索不足的主要性能瓶颈:64-批的准备和转移。我们展示了缓解这些瓶颈的一连串改进,包括一个性能设计社区抽样器、一个共享的模擬平行战略,以及配有GPUP的批量传输管道。我们还进行了一项实证分析,支持使用取样进行推断,表明测试的精度并没有受到实质性的损害。这种观察将培训和推断集中起来,简化了模型的实施。我们报告了若干基准数据集和GNNNW架构的全面实验结果,包括一个演示,即每张obbbpaper-Bas-100M的精度、共用模模级平行平行平行平行平行的平行同步战略战略8,以及我们的一个系统SAL-CIAL-25的测试系统,一个10级标准,一个超超超超速度的20级的20级的SLIAL-25。