High Bandwidth Memory (HBM) provides massive aggregated memory bandwidth by exposing multiple memory channels to the processing units. To achieve high performance, an accelerator built on top of an FPGA configured with HBM (i.e., FPGA-HBM platform) needs to scale its performance according to the available memory channels. In this paper, we propose an accelerator for BFS (Breadth-First Search) algorithm, named as ScalaBFS, that builds multiple processing elements to sufficiently exploit the high bandwidth of HBM to improve efficiency. We implement the prototype system of ScalaBFS and conduct BFS in both real-world and synthetic scale-free graphs on Xilinx Alveo U280 FPGA card real hardware. The experimental results show that ScalaBFS scales its performance almost linearly according to the available memory pseudo channels (PCs) from the HBM2 subsystem of U280. By fully using the 32 PCs and building 64 processing elements (PEs) on U280, ScalaBFS achieves a performance up to 19.7 GTEPS (Giga Traversed Edges Per Second). When conducting BFS in sparse real-world graphs, ScalaBFS achieves equivalent GTEPS to Gunrock running on the state-of-art Nvidia V100 GPU that features 64-PC HBM2 (twice memory bandwidth than U280).
翻译:高带宽内存( HBM ) 通过向处理单位披露多个存储频道, 提供大型集成记忆带带宽。 为了实现高性能, 在以 HBM (即 FPGA- HBM 平台) 配置的 FPGA (即 FPGA- FPGA- HBM 平台) 之上建起的加速器需要根据可用的存储频道来缩放其性能。 在本文中, 我们提议了一个名为 ScalaBFS 的 BFS (Breadth- First Search) 算法加速器, 以建立多个处理元素, 以充分利用 HBM的高频带宽来提高效率。 为了实现高性能, 我们实施了 ScalaBFS 原型系统, 并在 Xilinx Alveo U280 和 合成无规模图形上进行 BFSFS BFS( GG SI Travelople- State) 运行197 GSTBS- Streal State State States, 运行GPS- Streal- Streal- Streal- Streal- Block (GPSBlock) 时, GI- Streal- Streal- Strial- Streal- State State 时, 时, SBFSBSB- Strimal- Stlock 时, 时, 时, SB- Streal- Strial- Stri- Stow 时, 时, 时, 时, SBB- Strimal- Strimal-FS-FS- St 时, 时, 时, 时 时, 时运行 时, 时, 时正在 时 时 时将 时将 时将 时将 时将 时 时, 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时 时