天河刑前系统自定义图500 (Customizing Graph500 for Tianhe Pre-exacale system)

BFS (Breadth-First Search) is a typical graph algorithm used as a key component of many graph applications. However, current distributed parallel BFS implementations suffer from irregular data communication with large volumes of transfers across nodes, leading to inefficiency in performance. In this paper, we present a set of optimization techniques to improve the Graph500 performance for Pre-exacale system, including BFS accelerating with SVE (Scalable Vector extension) in matrix2000+, sorting with buffering for heavy vertices, and group-based monitor communication based on proprietary interconnection built in Tianhe Pre-exacale system. Performance evaluation on the customized Graph500 testing on the Tianhe Pre-exacale system achieves 2131.98 Giga TEPS on 512-node with 96608 cores, which surpasses the ranking of Tianhe-2 with about 16X fewer nodes in the June 2018 Graph500 list, and shows our customized Graph500 is 3.15 times faster on 512 nodes than the base version using the state-of-the-art techniques.

翻译：BFS(Breadth- First Search)是一种典型的图形算法,作为许多图形应用程序的关键组成部分。然而,目前分布的平行BFS实施过程存在不规则的数据通信,在节点之间有大量传输,导致性能低下。本文介绍了一套优化技术,以改善Exacale系统图500性能,包括BFS在矩阵2000年+中与SVE(可缩放矢量扩展)同步加速,对重脊椎进行缓冲排序,以及基于天河前排系统所建专有互连系统的小组监测通信。天河前排系统定制图500测试的性能评估在512诺德(96608核心)上实现了2131.98 Giga TEPS,该测试超过了Tianhe-2的排名,在2018年6月的图500列表中,其节点减少了约16X,显示我们定制的图500在使用最新技术的基版512个节点上速度为315倍。