Recently, Graph Neural Networks (GNNs) have been receiving a spotlight as a powerful tool that can effectively serve various inference tasks on graph structured data. As the size of real-world graphs continues to scale, the GNN training system faces a scalability challenge. Distributed training is a popular approach to address this challenge by scaling out CPU nodes. However, not much attention has been paid to disk-based GNN training, which can scale up the single-node system in a more cost-effective manner by leveraging high-performance storage devices like NVMe SSDs. We observe that the data movement between the main memory and the disk is the primary bottleneck in the SSD-based training system, and that the conventional GNN training pipeline is sub-optimal without taking this overhead into account. Thus, we propose Ginex, the first SSD-based GNN training system that can process billion-scale graph datasets on a single machine. Inspired by the inspector-executor execution model in compiler optimization, Ginex restructures the GNN training pipeline by separating sample and gather stages. This separation enables Ginex to realize a provably optimal replacement algorithm, known as Belady's algorithm, for caching feature vectors in memory, which account for the dominant portion of I/O accesses. According to our evaluation with four billion-scale graph datasets, Ginex achieves 2.11x higher training throughput on average (up to 2.67x at maximum) than the SSD-extended PyTorch Geometric.
翻译:最近,Greg Neal网络(GNN)一直受到关注,被视为一个强大的工具,可以有效地为图表结构数据中的各种推论任务服务。随着真实世界图形的大小继续扩大,GNN培训系统面临一个可缩放的挑战。分散培训是一种通过扩大CPU节点来应对这一挑战的流行方法。然而,没有多少关注磁盘GNN培训,这种培训可以通过利用NVME SSDs等高性能存储设备,以更具成本效益的方式扩大单一节点系统。我们观察到,主内存和磁盘之间的数据流动是SSD培训系统中的主要瓶颈,而传统的GNNN培训管道则是不考虑这一顶端的次优化。因此,我们建议Ginex,基于SDGNNN培训系统的第一个SDG培训系统能够处理单一机器上10亿级的图形数据集。在编译中,GINDEX对GNNNE培训管道进行了重组,在SDO的升级阶段将GNNE培训升级到最高级的GOVA, 升级到最高级的Slalalalalalalalal as 。通过Sqalalalalation Sqalation 进行最佳的升级到最高级的升级的Sqalevalizaldaldaldalevaldal 。