A key performance bottleneck when training graph neural network (GNN) models on large, real-world graphs is loading node features onto a GPU. Due to limited GPU memory, expensive data movement is necessary to facilitate the storage of these features on alternative devices with slower access (e.g. CPU memory). Moreover, the irregularity of graph structures contributes to poor data locality which further exacerbates the problem. Consequently, existing frameworks capable of efficiently training large GNN models usually incur a significant accuracy degradation because of the inevitable shortcuts involved. To address these limitations, we instead propose ReFresh, a general-purpose GNN mini-batch training framework that leverages a historical cache for storing and reusing GNN node embeddings instead of re-computing them through fetching raw features at every iteration. Critical to its success, the corresponding cache policy is designed, using a combination of gradient-based and staleness criteria, to selectively screen those embeddings which are relatively stable and can be cached, from those that need to be re-computed to reduce estimation errors and subsequent downstream accuracy loss. When paired with complementary system enhancements to support this selective historical cache, ReFresh is able to accelerate the training speed on large graph datasets such as ogbn-papers100M and MAG240M by 4.6x up to 23.6x and reduce the memory access by 64.5% (85.7% higher than a raw feature cache), with less than 1% influence on test accuracy.
翻译:在大型真实世界图形中培训图形神经网络(GNN)模型时,大型实时图形将节点功能装入GPU时,一个关键的性能瓶颈是一个关键性瓶颈。由于GPU内存有限,有必要进行昂贵的数据移动,以便利将这些特征存储在访问较慢的替代设备(例如CPU内存)上。此外,图形结构的不规则性导致数据位置差,从而进一步加剧了问题。因此,能够高效培训大型GNN模型的现有框架通常会因不可避免的捷径而出现显著的准确性下降。为了解决这些限制,我们提议 ReFresh,这是一个通用的GNN原始240小批培训框架,它利用历史缓存来储存和重新使用GNNN节嵌入,而不是通过每次循环获取原始特性来重新计算这些特征。此外,对于其成功与否,相关的缓存政策是使用基于梯度和标准,有选择地筛选那些相对稳定且可以隐藏的嵌入内容,从那些需要重新估计的GNNNW(G 240) 中,需要利用一个GNNW mm-240小的原始精密性培训框架,以历史缓冲速度缩缩缩缩缩缩缩缩缩缩。