Graph embedding maps graph nodes to low-dimensional vectors, and is widely adopted in machine learning tasks. The increasing availability of billion-edge graphs underscores the importance of learning efficient and effective embeddings on large graphs, such as link prediction on Twitter with over one billion edges. Most existing graph embedding methods fall short of reaching high data scalability. In this paper, we present a general-purpose, distributed, information-centric random walk-based graph embedding framework, DistGER, which can scale to embed billion-edge graphs. DistGER incrementally computes information-centric random walks. It further leverages a multi-proximity-aware, streaming, parallel graph partitioning strategy, simultaneously achieving high local partition quality and excellent workload balancing across machines. DistGER also improves the distributed Skip-Gram learning model to generate node embeddings by optimizing the access locality, CPU throughput, and synchronization efficiency. Experiments on real-world graphs demonstrate that compared to state-of-the-art distributed graph embedding frameworks, including KnightKing, DistDGL, and Pytorch-BigGraph, DistGER exhibits 2.33x-129x acceleration, 45% reduction in cross-machines communication, and > 10% effectiveness improvement in downstream tasks.
翻译:图嵌入将图节点映射为低维向量,并广泛应用于机器学习任务。亿级边图的不断出现凸显了在大型图上学习高效而有效的嵌入的重要性,例如在具有超过10亿边的Twitter上进行链路预测。现有的大多数图嵌入方法无法实现高数据可伸缩性。本文提出了一个通用的、分布式的、信息中心的随机游走图嵌入框架DistGER,可以扩展到嵌入亿级边图。DistGER逐步计算信息中心的随机游走。它进一步利用多邻域感知、流式、并行图分区策略,在机器之间同时实现高局部分区质量和良好的工作负载平衡。DistGER还改进了分布式Skip-Gram学习模型,通过优化访问局部性、CPU吞吐量和同步效率来生成节点嵌入。实验结果表明,与最先进的分布式图嵌入框架,包括KnightKing、DistDGL和Pytorch-BigGraph相比,DistGER表现出2.33倍至129倍的加速,跨机器通信减少45%,下游任务的有效性提高超过10%。