PyTorch- Direct: 启用 GPU 中心数据,用于极大型图像神经网络培训,且不定期访问 (PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses)

With the increasing adoption of graph neural networks (GNNs) in the machine learning community, GPUs have become an essential tool to accelerate GNN training. However, training GNNs on very large graphs that do not fit in GPU memory is still a challenging task. Unlike conventional neural networks, mini-batching input samples in GNNs requires complicated tasks such as traversing neighboring nodes and gathering their feature values. While this process accounts for a significant portion of the training time, we find existing GNN implementations using popular deep neural network (DNN) libraries such as PyTorch are limited to a CPU-centric approach for the entire data preparation step. This "all-in-CPU" approach has negative impact on the overall GNN training performance as it over-utilizes CPU resources and hinders GPU acceleration of GNN training. To overcome such limitations, we introduce PyTorch-Direct, which enables a GPU-centric data accessing paradigm for GNN training. In PyTorch-Direct, GPUs are capable of efficiently accessing complicated data structures in host memory directly without CPU intervention. Our microbenchmark and end-to-end GNN training results show that PyTorch-Direct reduces data transfer time by 47.1% on average and speeds up GNN training by up to 1.6x. Furthermore, by reducing CPU utilization, PyTorch-Direct also saves system power by 12.4% to 17.5% during training. To minimize programmer effort, we introduce a new "unified tensor" type along with necessary changes to the PyTorch memory allocator, dispatch logic, and placement rules. As a result, users need to change at most two lines of their PyTorch GNN training code for each tensor object to take advantage of PyTorch-Direct.

翻译：随着机器学习界越来越多地采用图形神经网络(GNN),GPU成为加速GNN培训的关键工具。然而,在与GPU记忆不匹配的非常大图形上培训GNNS仍然是一项艰巨的任务。与传统的神经网络不同,GNNS的微型喷射输入样本需要复杂的任务,例如穿行周边节点和收集其特性值。虽然这一过程占培训时间的很大一部分,但我们发现使用广受欢迎的深层神经网络(DNN)图书馆(例如PyTorrch)的现有GNNN执行。PyTorrch等图书馆仅限于整个数据编制步骤的CPU-中心化方法。但是,“全在CPU”这一方法对GNNN培训的总体业绩产生了负面影响,因为它过度使用 CPU资源,并阻碍GNN培训的加速。为了克服这些限制,我们引入了PyToirch-D, 将GNP-NV培训中的G-nal-dor-endor to kend to kend to feral 将Gnal 的Cnal-destrain a train train train to we.