Real-world node embedding applications often contain hundreds of billions of edges with high-dimension node features. Scaling node embedding systems to efficiently support these applications remains a challenging problem. In this paper we present a high-performance multi-GPU node embedding system. It uses model parallelism to split node embeddings onto each GPU's local parameter server, and data parallelism to train these embeddings on different edge samples in parallel. We propose a hierarchical data partitioning strategy and an embedding training pipeline to optimize both communication and memory usage on a GPU cluster. With the decoupled design of CPU tasks (random walk) and GPU tasks (embedding training), our system is highly flexible and can fully utilize all computing resources on a GPU cluster. Comparing with the current state-of-the-art multi-GPU single-node embedding system, our system achieves 5.9x-14.4x speedup on average with competitive or better accuracy on open datasets. Using 40 NVIDIA V100 GPUs on a network with almost three hundred billion edges and more than one billion nodes, our implementation requires only 3 minutes to finish one training epoch.
翻译:实际世界节点嵌入应用程序通常包含数千亿个具有高分层节点特点的边缘。 增强节点嵌入系统以高效支持这些应用程序仍然是一个棘手的问题。 在本文中, 我们展示了一个高性能的多CPU嵌入系统。 它使用模型平行来分割节点嵌入每个 GPU 本地参数服务器, 并同时培训不同边缘样本中的这些嵌入系统。 我们提出一个分级数据分割战略和嵌入培训管道, 以优化 GPU 集的通信和记忆使用。 随着 CPU 任务( 随机行走) 和 GPUPU 任务( 组合培训) 的拆解设计, 我们的系统非常灵活, 可以充分利用 GPU 集中的所有计算资源。 与当前最先进的多GPU单节点嵌入系统比较, 我们的系统平均速度达到5. 9x-14.4xxxxx速度, 在开放数据集上以竞争性或更精确的方式优化通信和记忆使用。 使用40 NVIDIA V100 GPPP 在一个有近30亿至10分钟的网络上, 只需要完成我们的训练。