The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for linear solvers which often dominate the run time of scientific and engineering applications. In this paper, we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) vector streaming reuse (VSR) and decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory accesses with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. To the best of our knowledge, this is the first work to introduce the concept of VSR for data reusing between on-chip modules to reduce unnecessary off-chip accesses for FPGA accelerators. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla achieves a speedup of 3.94x, 3.36x higher throughput, and 2.94x better energy efficiency. Compared to an NVIDIA A100 GPU which has 4x the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34x higher energy efficiency. The code is available at https://github.com/UCLA-VAST/Callipepla.
翻译:FPGA的加速度加上高带宽存储器(HBM)的不断增长,使像Xilinx U280这样的系统成为线性求解器的可靠平台,这些平台往往在科学和工程应用的运行时间中占主导地位。在本文件中,我们展示了Callipepla,这是一个具有先决条件的调控梯度梯度线性求解器的加速器。 FPGA的加速度面临三个挑战:(1) 如何支持一个任意的问题并终止在飞上加速处理;(2) 如何协调处理模块之间的长位流数据流;(3) 如何协调处理模块之间的长位流数据流;以及(3) 如何协调处理模块之间的长位流数据流流流数据流;(3) 如何协调处理系统流数据流数据流数据流的长位流数据流;(3) 如何在处理模块之间保持高端端点数据流(FLO) 并保持有效的双精度数据流质量解决方案。我们最了解的是,为了应对三大挑战,我们第一次将VSR概念引入了用于高效流流流流数据流数据流处理的HCFA3号。